arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
专题追踪
2605.09457 2026-05-12 cs.LG cs.AI cs.SI

RAwR: Role-Aware Rewiring via Approximate Equitable Partition

Riccardo Porcedda, Giuseppe Squillace, Bastian Epping, Andrea Vandin, Michael Schaub, Mirco Tribastone, Francesca Chiaromonte

发表机构 * Department of Excellence L’EMbeDS, Sant’Anna School of Advanced Studies, Pisa, Italy(卓越部门L’EMbeDS,圣安娜高级研究学院,意大利比萨) Department of Computer Science, University of Pisa, Italy(计算机科学系,意大利比萨大学) Department of Statistics, The Pennsylvania State University, USA(统计系,美国宾夕法尼亚州立大学) Huck Institutes of the Life Sciences, The Pennsylvania State University, USA(生命科学学院Huck研究所,美国宾夕法尼亚州立大学) IMT School for Advanced Studies, Lucca, Italy(IMT高级研究学院,意大利卢卡) Computational Network Science, RWTH Aachen University, Aachen, Germany(计算网络科学,德国亚琛工业大学,亚琛) DTU Technical University of Denmark, Lyngby, Denmark(丹麦技术大学DTU,Lyngby,丹麦)

AI总结 本文提出了一种名为RAwR的图神经网络(GNN)重构框架,旨在解决GNN在处理依赖长距离交互的预测任务时性能下降的问题。该方法通过引入基于近似等分划分的商图,增强输入图的结构表达,促进具有相同结构角色的节点之间的信息传播,从而降低系统的有效电阻。实验表明,RAwR在多种基准数据集上取得了最先进的性能,并通过理论分析提出了用于优化重构效果的谱角色提升(SRL)指标。

详情
英文摘要

While Graph Neural Networks (GNNs) have demonstrated significant efficacy in node classification tasks, where predictions rely on local neighborhood information, the performance of GNNs often drops when prediction tasks depend on long-range interactions. These limitations are attributed to phenomena such as oversquashing, where structural bottlenecks restrict signal propagation across the network topology. To address this challenge, we introduce RAwR, a computationally efficient rewiring framework that augments the input graph with a quotient graph derived from equitable partitions. This approach facilitates accelerated communication between nodes that share identical structural roles, as identified by the Weisfeiler-Leman graph coloring, and thereby reduces the total effective resistance of the system. Furthermore, by employing an approximate definition of the equitable partition, RAwR enables a controllable reduction of the quotient graph, which, in its most condensed state, recovers the conventional Master Node rewiring technique. Empirical evaluations across a diverse suite of benchmarks -- including homophilic, heterophilic, and synthetic long-range datasets -- demonstrate that RAwR achieves state-of-the-art results. Our contribution is further supported by an analytical investigation using a teacher-student model of linear GNNs, which elucidates the theoretical foundations of role-based rewiring. This analysis leads to the formulation of Spectral Role Lift (SRL), a metric designed to identify the optimal approximate equitable partition for maximizing predictive performance.

2605.09455 2026-05-12 cs.CV

Adaptive 3D Convolution for Remote Sensing Image Fusion

Siran Peng, Xiangyu Zhu, Shang-Qi Deng, Liang-Jian Deng, Zhen Lei

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,自动化研究所,中国科学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) School of Mathematical Sciences/Multi-Hazard Early Warning Key Laboratory of Sichuan Province, University of Electronic Science and Technology of China(数学科学学院/四川省多灾种早期预警重点实验室,电子科技大学) Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences(人工智能与机器人中心,香港科学院,中国科学院)

AI总结 本文研究了遥感图像融合问题,旨在从高分辨率但光谱信息有限的图像和低分辨率但光谱数据丰富的图像中生成高分辨率多/高光谱图像。为了解决现有方法在光谱信息保持和计算效率上的不足,作者提出了一种新型的自适应三维卷积(Ada3D)方法,该方法为每个输入体素生成独特的三维卷积核,结合空间和光谱信息,有效提升了融合效果,并通过分组卷积降低了计算复杂度。实验表明,该方法在五个数据集上均取得了当前最优的性能。

Comments Accepted by IEEE Transactions on Image Processing (TIP), Early Access, 2026

详情
英文摘要

Remote sensing image fusion aims to create a high-resolution multi/hyper-spectral image from a high-resolution image with limited spectral information and a low-resolution image with abundant spectral data. Recently, deep learning (DL) techniques have shown significant effectiveness in this area. Most DL-based methods approach image fusion as a 2D problem by encoding spectral information into feature map channels. However, our research suggests that this strategy introduces notable spectral distortions. In contrast, some methods consider spectral data as an additional dimension, utilizing standard 3D convolutions to preserve spectral information. Nevertheless, in a standard 3D convolutional layer, the same set of kernels is applied across all input regions, which we have found to be sub-optimal for image fusion. Furthermore, standard 3D convolutions necessitate substantial computational resources. To address these challenges, we propose a novel convolutional paradigm called Adaptive 3D Convolution (Ada3D) for remote sensing image fusion. Ada3D applies a unique set of 3D kernels to each input voxel, enabling the capture of fine-grained details. These adaptive kernels are generated through a two-step process: (i) spatial and spectral kernels are derived from their respective image sources; (ii) these two types of kernels are then combined to form content-aware 3D kernels that effectively integrate spatial and spectral information. Additionally, adaptive biases are introduced to enhance the convolutional outcome at the voxel level. Furthermore, we incorporate the group convolution technique to reduce computational complexity. As a result, Ada3D offers full adaptivity in an efficient manner. Evaluation results across five datasets demonstrate that our method achieves SOTA performance, underscoring the superiority of Ada3D. The code is available at https://github.com/PSRben/Ada3D.

2605.09449 2026-05-12 cs.CV

SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs

Bo Gu, Zhikang Zhang, Zizhuang Wei, Zhenyuan Chen, Lingyun Li, Zhuoyi Song

发表机构 * Fudan University(复旦大学) Huawei(华为) Shenzhen Loop Area Institute(深圳环城院)

AI总结 当前多模态大语言模型(MLLMs)在视觉理解和语言推理方面取得了显著进展,但在三维环境中缺乏持续的、以世界为中心的空间表征。为此,研究提出了一种名为 SpaceMind++ 的视频 MLLM 架构,通过从 RGB 视频中构建体素化的认知地图,实现对物体永久性和空间拓扑关系的保持。该模型引入了坐标引导的深度迭代融合机制,将地图层面的空间知识反馈至原始二维视觉特征中,从而在不破坏原有视觉接口的前提下增强模型的空间推理能力。实验表明,SpaceMind++ 在多个基准测试中取得了优异的性能,尤其在未见过的三维环境中表现出更强的泛化能力。

Comments 14 pages, 3 figures

详情
英文摘要

Recent multimodal large language models (MLLMs) have made remarkable progress in visual understanding and language-based reasoning, yet they lack a persistent world-centered representation for spatially consistent reasoning in 3D environments. Inspired by the mammalian dual-stream system, where semantic and spatial cues are processed separately and integrated into an allocentric cognitive map, we propose SpaceMind++, a video MLLM architecture that explicitly builds a voxelized cognitive map from RGB videos. This map reorganizes fragmented egocentric observations into a shared 3D metric representation, enabling the model to preserve object permanence and spatial topology across changing viewpoints. To make this allocentric representation usable by a pretrained video MLLM without disrupting its native visual-token interface, we introduce Coordinate-Guided Deep Iterative Fusion, a new mechanism that relays map-level spatial knowledge back into the original 2D visual features. This fusion is explicitly guided by coordinate embeddings and 3D Rotary Positional Encoding, which ground semantic interactions in metric 3D space, resembling the entorhinal binding of sensory features to metric space. Extensive experiments show that SpaceMind++ achieves new state-of-the-art performance on VSI-Bench. Furthermore, it demonstrates superior out-of-distribution generalization on SPBench, SITE-Bench, and SPAR-Bench, underscoring its robustness in unseen 3D environments.

2605.09448 2026-05-12 cs.LG

Learning to Bid with Unknown Private Values in Budget-Constrained First-Price Auctions

Zihao Hu, Yuxiao Wen, Yuan Yao, Jiheng Zhang, Zhengyuan Zhou

发表机构 * Department of Mathematics, The Hong Kong University of Science and Technology(香港科技大学数学系) Stern School of Business, New York University(纽约大学斯特恩商学院)

AI总结 本文研究了在预算约束和投资回报率目标限制下的首价拍卖中的竞价学习问题,其中竞拍者的估值无法直接获取,只能从被截断的数据中推断。为解决这一问题,作者提出了一种统一的原始-对偶框架,同时学习潜在的估值参数和竞争对手的出价分布。该方法通过引入强斯拉特条件和自适应预热过程,有效控制对偶变量的稳定性,从而实现了接近最优的遗憾界,为具有隐含估值的约束竞价问题提供了首个理论保障的解决方案。

详情
英文摘要

The transition to First-Price Auctions (FPA) in digital advertising has spurred significant research, yet existing work typically assumes access to a valuation oracle, ignoring the reality that values must be inferred from censored data. While Linear Treatment Effect (LTE) models address this by learning value uplift, they have not been adapted to realistic settings with hard Budget constraints or Return-on-Spend (RoS) targets requiring regret and violation control. In this work, we propose a unified primal-dual framework for constrained FPAs that jointly learns the latent LTE valuation parameters and the competitor's bid distribution. This simultaneous learning introduces a critical technical challenge: the estimation error is dynamically scaled by the Lagrangian multiplier, potentially leading to unbounded regret. We resolve this by leveraging a strong Slater condition and a novel adaptive burn-in procedure to stabilize the dual variables. Our approach achieves near-optimal regret guarantees, providing the first theoretically grounded solution for constrained bidding with latent valuations.

2605.09443 2026-05-12 cs.CV cs.CL

Through the Lens of Character: Resolving Modality-Role Interference in Multimodal Role-Playing Agent

Yihong Tang, Kehai Chen, Xuefeng Bai, Min Zhang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Shenzhen Loop Area Institute (SLAI)(深圳环城院)

AI总结 随着多模态大语言模型的发展,角色扮演代理(RPAs)逐渐进入视觉化环境,但现有模型提取的通用视觉特征容易掩盖角色特性,导致模态-角色干扰(MRI)。为此,研究提出了一种无需训练的字符感知视觉干预框架CAVI,通过角色引导的标记剪枝、正交特征调制和模态自适应角色引导等方法,有效缓解MRI问题,显著提升了角色一致性的多模态交互能力。

详情
英文摘要

The advancement of Multimodal Large Language Models (MLLMs) has expanded Role-Playing Agents (RPAs) into visually grounded environments. However, human vision is inherently subjective and identity-driven, whereas existing MLLMs extract objective, character-agnostic features for general tasks. In RPAs, this generic visual noise overpowers fragile character traits, causing Modality-Role Interference (MRI), where agents struggle to integrate visual grounding and character consistency. To address this, we introduce the training-free Character-Aware Visual Intervention (CAVI) framework, enabling agents to perceive the world through the lens of character. CAVI systematically targets MRI: macroscopically, Character-Guided Token Pruning (CTP) restricts the visual receptive field to role-relevant entities; microscopically, Orthogonal Feature Modulation (OFM) projects tokens onto a character-context subspace to extract aligned facts; and during decoding, Modality-Adaptive Role Steering (MARS) dynamically optimizes steering intensity based on visual reliance. Extensive experiments show CAVI effectively alleviates MRI, significantly enhancing character-consistent multimodal interactions.

2605.09442 2026-05-12 cs.CV cs.AI

SWIFT: Prompt-Adaptive Memory for Efficient Interactive Long Video Generation

Shanwen Tan, Hao Li, Jingtao Zhang, Xiaosong Jia, Xue Yang, Shaofeng Zhang, Yanyong Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) Fudan University(复旦大学) Georgia Institute of Technology(佐治亚理工学院) Shanghai Jiao Tong University(上海交通大学)

AI总结 SWIFT 是一种用于多提示长视频生成的高效框架,旨在解决连续语义切换中的语义连贯性与计算效率之间的矛盾。该方法引入了轻量级的语义注入缓存和自适应动态窗口机制,能够在不重建缓存内容的前提下实现高效的语义切换,并通过分头语义注入和段级语义锚点保持视频的时序一致性。实验表明,SWIFT 在单块 H100 GPU 上实现了 22.6 FPS 的生成速度,显著提升了长视频生成的效率。

Comments Code is available at https://github.com/ShanwenTan/SWIFT

详情
英文摘要

Streaming long-video generation faces a central challenge in continuous semantic switching, requiring adaptive memory to preserve coherent visual evolution. Current approaches rely on cache rebuilding at prompt boundaries or fixed memory budgets, but they introduce redundant computation and limit flexible semantic adaptation. This limitation arises from a mismatch between cached video history and prompt updates, as memory preserves visual continuity while prompt switches demand rapid semantic adaptation. Motivated by this observation, we present SWIFT, Semantic Windowing and Injection for Flexible Transitions, a training-free framework for multi-prompt long-video generation that enables efficient semantic switching while preserving temporal coherence in causal video diffusion models. SWIFT introduces a lightweight Semantic Injection Cache that augments cached video memory rather than reconstructing it from scratch at every prompt boundary. To avoid uniformly perturbing all attention channels, we further perform head-wise semantic injection, so that each attention head receives a prompt update proportional to its alignment with the current video state. In addition, we introduce an Adaptive Dynamic Window that allocates temporal memory according to prompt phase, using larger local context near switching boundaries and smaller windows during stable segments to reduce average inference cost. To preserve long-range semantic consistency under compressed local attention, we further maintain segment-level semantic anchors that summarize prompt-conditioned video history and reintroduce it as compact memory tokens. Compared with current state-of-the-art methods, SWIFT preserves generation quality while achieving 22.6 FPS on a single H100 GPU, establishing a substantially more efficient solution for multi-prompt long-video generation. Our code is available at https://github.com/ShanwenTan/SWIFT.

2605.09441 2026-05-12 cs.RO

Beyond Isolation: A Unified Benchmark for General-Purpose Navigation

Samson Sun, Tianyi Yang, Tengyue Wang, Yikai Xue, Zhengjie Xu, Lingming Zhang, Qichen Zhang, Chao Liang, Zhipeng Zhang

发表机构 * AutoLab, SAI, Shanghai Jiao Tong University(自动化实验室、上海交通大学) Research Lab, Anyverse Dynamics(Anyverse Dynamics 研究实验室)

AI总结 当前通用导航智能体的发展受到碎片化评估体系的限制,这些体系往往孤立测试导航技能并聚焦于特定机器人形态,难以反映现实场景中多样行为的协调需求。为此,研究提出OmniNavBench基准,旨在评估跨技能协作与跨形态泛化能力。该基准引入复合任务指令、多形态机器人支持及高质量人类演示,推动导航智能体在复杂、交错任务场景下的性能提升,揭示了现有方法在通用导航任务中的不足,为下一代通用导航系统提供了新的测试平台。

Comments Accepted at RSS 2026

详情
英文摘要

The pursuit of general-purpose embodied agents is hindered by fragmented evaluation protocols that isolate navigation skills and fixate on specific robot morphologies, failing to reflect real-world scenarios where agents must orchestrate diverse behaviors across varying embodiments. To bridge this gap, we introduce OmniNavBench, a benchmark for cross-skill coordination and cross-embodiment generalization. OmniNavBench introduces three paradigm shifts: (1) Compositional Complexity. We propose composite instructions that interleave sub-tasks from 6 categories (PointNav, VLN, ObjectNav, SocialNav, Human Following and EQA), compelling agents to transition between exploration, interaction, and social compliance within a single episode. (2) Morphological Universality and Sensor Flexibility. We present a simulation platform that breaks the reliance on single-morphology evaluation, enabling generalization tests across humanoid, quadrupedal, and wheeled robots, with a modular sensor interface and 170 environments blending synthetic assets with real-world scans. (3) Demonstrations Quality. Moving beyond shortest-path algorithms, we curate 1779 expert trajectories via human teleoperation, capturing behavioral nuances such as exploratory glance and anticipatory avoidance. Extensive evaluations demonstrate that current methods, despite their claimed unified design, struggle with the complex, interleaved nature of general-purpose navigation. This exposes a critical disparity between existing capabilities and real-world deployment demands, underscoring OmniNavBench as a testbed for the next generation of generalist navigators. Dataset, code, and leaderboard are available at http://omninavbench.cloud-ip.cc.

2605.09440 2026-05-12 cs.CL cs.AI

Key Coverage Matters: Semi-Structured Extraction of OCR Clinical Reports

Yu Wang, Yingyun Li, Ying Qin, Haiyang Qian

发表机构 * AI Starfish

AI总结 临床报告因隐私政策和数据孤岛问题常分散在不同医疗机构,导致电子健康记录整合和长期跟踪困难。本文提出一种基于关键字段的半结构化信息提取方法,通过迭代挖掘、归一化和聚类构建关键字段库,并引入“关键覆盖率”衡量信息完整性。实验表明,随着关键覆盖率提升,模型性能显著增强,在覆盖前90个关键字段时,F1分数分别达到0.839和0.893,且该方法适用于多语言场景。

Comments Preprint. Under review at MLHC 2026

详情
英文摘要

Clinical reports are often fragmented across healthcare institutions because privacy regulations and data silos limit direct information sharing. When patients seek care at a different hospital, they often carry paper or scanned reports from prior visits. This hinders EHR integration and longitudinal review, and downstream applications that depend on more complete patient records, such as patient management, follow-up care, real-world studies, and clinical-trial matching. Although OCR can digitize such reports, reliable extraction remains challenging because clinical documents are heterogeneous, OCR text is noisy, and many healthcare settings require low-cost on-premise deployment. We formulate this problem as canonical key-conditioned extractive question answering over OCR-derived clinical reports. Because the key fields are neither fixed nor known in advance, the key space is open. We maintain a canonical key inventory through iterative key mining, normalization, clustering, and lightweight human verification, and introduce key coverage as a metric to quantify inventory completeness. Using a 0.2B BERT-based model, experiments on real-world reports from more than 20 hospitals show performance improves monotonically with key coverage. The model achieves F1 scores of 0.839 and 0.893 under exact match and boundary-tolerant matching, respectively, once the Top-90 canonical keys are covered. These results show that key coverage is a dominant factor for end-to-end performance. At Top-90 coverage, our model outperforms a fine-tuned Qwen3-0.6B baseline under exact match. Although our annotated corpus is Chinese, the method relies on the language-agnostic key-value organization of semi-structured clinical reports and can be adapted to other settings given an appropriate canonical key inventory and alias mapping.

2605.09439 2026-05-12 cs.LG stat.ML

Inverse Design for Conditional Distribution Matching

Ori Meidler, Shaul Tolkovsky, Or Zuk

发表机构 * Department of Statistics and Data Science(统计与数据科学系)

AI总结 该论文提出了一种新的逆设计问题——条件分布匹配(CDM),旨在从给定的联合分布 $\mathcal{P}(X, Y)$ 中找到输入 $x^*$,使得其诱导的条件分布 $\mathcal{P}(Y \mid X = x^*)$ 与目标分布 $\mathcal{G}(Y)$ 匹配。为了解决这一问题,作者提出了 MLGD-F 算法,结合预训练的扩散模型和快速条件采样器,在无需额外训练的情况下实现高效求解。实验表明,该方法在多种任务中能够可靠地恢复出满足用户指定分布目标的输入。

详情
英文摘要

Generative models are powerful tools for sampling from a learned distribution $\mathcal{P}(Y \mid X)$, and inverse-design methods invert this map to find an input $x$ that produces a desired point output $y^*$. However, many design goals are naturally distributional rather than pointwise, incorporating the inherent uncertainty of $Y$ and targeting a specific form for it, a task not addressed by standard inverse design. To address this issue we introduce Conditional Distribution Matching (CDM), a new inverse-design problem class in generative modeling: given a joint distribution $\mathcal{P}(X, Y)$ and a target distribution $\mathcal{G}(Y)$, find an input $x^*$ whose induced conditional distribution $\mathcal{P}(Y \mid X = x^*)$ matches $\mathcal{G}$. We formally define two variants: Conditional Distribution Matching Sampling (CDMS) and Conditional Distribution Matching Optimization (CDMO). To solve these problems, we propose MLGD-F (Matching-Loss Guided Diffusion with a Fast inner sampler), a plug-and-play inference-time algorithm that combines a pretrained score-based diffusion model with a pretrained fast conditional sampler, requiring no additional training or fine-tuning. By leveraging single-step conditional sampling, MLGD-F enables tractable gradient computation, making the estimation of $\mathcal{P}(Y \mid X)$ both memory-efficient and computationally lightweight. We validate MLGD-F on synthetic benchmarks, structured image transformations, and generative editing optimization, demonstrating reliable recovery of inputs whose conditional distributions match diverse user-specified targets, including discrete mixtures and continuous low-rank supports.

2605.09438 2026-05-12 cs.LG

fmxcoders: Factorized Masked Crosscoders for Cross-Layer Feature Discovery

Andreas D. Demou, Panagiotis Koromilas, James Oldfield, Yannis Panagakis, Mihalis A. Nicolaou

发表机构 * The Cyprus Institute(塞浦路斯研究所) University of Athens(雅典大学) University of Oxford(牛津大学) Archimedes AI/Athena Research Center(Archimedes AI/ Athena 研究中心) University of Cyprus(塞浦路斯大学)

AI总结 该研究针对预训练Transformer模型中跨层特征提取的问题,提出了一种新的方法fmxcoders,旨在更有效地在共享潜在空间中恢复跨层特征。传统Crosscoders方法在跨层参数化和依赖关系上存在局限,导致学到的潜在变量主要捕捉表面模式而非深层语义概念。fmxcoders通过引入低秩张量分解和随机层掩码机制,提升了潜在变量的跨层一致性与语义可解释性,并在多个基础模型上显著提高了特征探测和重建性能。

详情
英文摘要

Many features in pretrained Transformers span multiple layers: they emerge through stages of inference, persist in the residual stream, or are built jointly by parallel MLPs. Crosscoders (namely, sparse dictionaries trained jointly across layers) aim to recover these cross-layer features in a single shared latent space. We show that standard crosscoders largely fail at this purpose. Although their decoder weight norms spread evenly across layers, a functional coherence metric we introduce reveals that each latent's activation is effectively driven by only one or two layers on average. While functionally coherent latents act as human-interpretable concept detectors (e.g., US states and cities), the layer-localized latents that crosscoders predominantly learn collapse onto surface-level patterns such as digit detectors. We trace this failure to two structural limitations: unconstrained cross-layer parameterization and unregularized cross-layer dependence. We address both by introducing fmxcoders, which (i) replace the encoder and decoder with low-rank tensor factorizations that draw every latent's per-layer weights from a shared cross-layer basis, and (ii) apply stochastic layer masking, a denoising regularizer along the layer axis that penalizes latents whose contribution collapses when a single layer is masked. Across GPT2-Small, Pythia-410M, Pythia-1.4B, and Gemma2-2B, fmxcoders lift mean probing F1 by 10-30 points, surpassing per-layer SAE baselines that standard crosscoders fail to reach, reduce reconstruction MSE by 25-50%, and roughly double mean functional coherence. An LLM-as-a-judge evaluation further shows that fmxcoders recover 3-13$\times$ more semantically coherent latents than standard crosscoders across all four base LLMs.

2605.09433 2026-05-12 cs.CV

Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

Yunhong Lu, Qichao Wang, Hengyuan Cao, Xiaoyin Xu, Min Zhang

发表机构 * Zhejiang University(浙江大学) Shanghai Institute for Advanced Study-Zhejiang University(上海先进研究院-浙江大学) Shanghai Institute for Mathematics and Interdisciplinary Sciences(上海数学与交叉科学研究院)

AI总结 现有文本到图像模型的偏好数据集通常仅存储最终的优胜或劣汰图像,这不足以支持基于直化流(RF)模型的生成过程,因其生成过程依赖特定的先验噪声样本并遵循近似直线的去噪轨迹。为此,本文提出了一种针对直化流模型的离线偏好优化框架——先验噪声感知偏好优化(PNAPO),通过保留生成优胜/劣汰图像所用的配对先验噪声,扩展标准三元组为六元组,并利用RF的直线特性进行噪声-图像插值,从而更准确地估计轨迹并提升优化目标的紧致性。实验表明,PNAPO在主流RF文本到图像模型上显著提升了偏好指标,同时减少了训练计算量。

Comments Accepted by ICML 2026

详情
英文摘要

Existing preference datasets for text-to-image models typically store only the final winner/loser images. This representation is insufficient for rectified flow (RF) models, whose generation is naturally indexed by a specific prior noise sample and follows a nearly straight denoising trajectory. In contrast, prior DPO-style alignment for diffusion models commonly estimates trajectories using an independent forward noising process, which can be mismatched to the true reverse dynamics and introduces unnecessary variance. We propose Prior Noise-Aware Preference Optimization (PNAPO), an off-policy alignment framework specialized for rectified flow. PNAPO augments preference data by retaining the paired prior noises used to generate each winner/loser image, turning the standard (prompt, winner, loser) triplet into a sextuple. Leveraging the straight-line property of RF, we estimate intermediate states via noise-image interpolation, which constrains the trajectory estimation space and yields a tighter surrogate objective for preference optimization. In addition, we introduce a dynamic regularization strategy that adapts the DPO regularization based on (i) the reward gap between winner and loser and (ii) training progress, improving stability and sample efficiency. Experiments on state-of-the-art RF T2I backbones show that PNAPO consistently improves preference metrics while substantially reducing training compute.

2605.09431 2026-05-12 cs.CL

PumpSense: Real-Time Detection and Target Extraction of Crypto Pump-and-Dumps on Telegram

Ahmed Mahrous, Roberto Di Pietro

发表机构 * King Abdullah University of Science and Technology (KAUST)(国王阿卜杜勒-阿齐兹科技大学)

AI总结 PumpSense 是一项针对 Telegram 上加密货币“拉高出货”行为的实时检测与目标提取研究。该研究构建了一个包含 28 万多条消息的标注数据集,用于识别泵动公告及其目标币种和交易所,并提出了基于机器学习和大语言模型的检测与提取方法,实现了近实时的检测能力。研究还首次建立了相关任务的基准,证明了基于大模型的方法在目标提取任务中具有显著优势。

Comments Accepted to the 2026 IEEE International Conference on Blockchain and Cryptocurrency (ICBC)

详情
英文摘要

Cryptocurrency pump-and-dump schemes coordinated via Telegram threaten market integrity. However, existing research addressing this specific threat has not yet produced solutions that combine reliable results with fast response. This is in part due to the absence of publicly available, message-level labeled data, as well as design choices. In this paper, we address both issues. In particular, we introduce a corpus of over 280,000 Telegram posts from 39 pump-organizing groups, all manually reviewed to identify 2,246 pump announcements and their targeted cryptocurrency and exchange. Leveraging this dataset, we define two tasks: real-time pump-announcement detection and target cryptocurrency/exchange extraction. For detection, we compare two machine-learning models: a lightweight tree-based LightGBM classifier (F1=0.79, latency=9.4 s/sample) and a transformer-based BGE-M3 (F1=0.83, latency=50 ms/sample). With our proposed approach, we show that message analysis can achieve near-instant pump detection at the level of individual Telegram message windows. Unlike prior work that relies purely on market data and typically detects pumps tens of seconds after abnormal trading activity is observed, our method operates directly on the coordination messages themselves and can be evaluated in microseconds per window on commodity hardware. To our knowledge, we also establish the first benchmark for manipulated coin and exchange extraction. We demonstrate that traditional rule-based extraction methods, widely relied upon in prior literature, are ineffective due to ticker ambiguity. In contrast, LLMs achieve the highest accuracy with a score of 0.91.

2605.09429 2026-05-12 cs.CV cs.AI

Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

Jie Ma, Yihang Liu, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun

发表机构 * Xiamen University(厦门大学)

AI总结 该研究探讨了在视觉-语言模型中,低注意力视觉token是否真的冗余,并指出现有剪枝方法基于浅层注意力分数进行剪枝可能影响模型对复杂场景的推理能力,导致“视觉失语”问题。为此,作者提出了一种无需训练的剪枝框架COAST,通过对比自适应语义token剪枝,利用跨模态注意力识别关键token并平衡语义证据与空间上下文的关系。实验表明,COAST在多个基准上大幅减少了视觉token数量并提升了推理速度,同时保持了较高的模型性能,展示了其在不同模型和压缩设置下的广泛适用性。

详情
英文摘要

Are low-attention visual tokens truly redundant in vision-language reasoning? Existing pruning methods often assume so, ranking visual tokens by shallow text-to-image attention and discarding low-scoring patches to accelerate LVLM inference. We show that this scalar criterion is unreliable for compositional reasoning: tokens ignored in early layers can later become essential for resolving secondary objects, spatial relations, and contextual cues. Premature pruning can therefore induce Visual Aphasia, a failure mode in which the model loses visual grounding and falls back on language priors. We introduce COAST (COntrastive Adaptive Semantic Token Pruning), a training-free pruning framework that casts compression as adaptive semantic routing. COAST uses native cross-modal attention to identify query-specific anchors and estimate contextual dispersion via attention entropy, then adapts the retention trade-off between semantic evidence and spatial context. It further uses a contrastive routing score to preserve both anchor-aligned evidence and complementary spatial context. Across seven benchmarks, COAST reduces visual tokens by 77.8% and achieves a 2.15x latency speedup while retaining 98.64% of the original average performance. Beyond a single backbone or compression setting, COAST consistently outperforms strong pruning baselines across token budgets and generalizes across multiple LVLM families, showing that adaptive semantic routing is a robust alternative to one-shot scalar pruning

2605.09428 2026-05-12 cs.LG

FedCIGAR: A Personalized Reconstruction Approach for Federated Graph-level Anomaly Detection

Yunfeng Zhao, Yixin Liu, Qingfeng Chen, Shiyuan Li, Yue Tan, Shirui Pan

发表机构 * Guangxi University(广西大学) Griffith University(格里菲斯大学)

AI总结 本文提出了一种名为FedCIGAR的联邦图级异常检测方法,旨在解决分布式场景下隐私保护与模型泛化能力之间的矛盾。该方法通过在正常图上进行重建学习,避免使用不真实的合成异常数据,并引入客户端节点贡献门控机制与服务器端滑动窗口聚类策略,以应对数据异构性带来的挑战。实验表明,FedCIGAR在多个基准数据集上表现出优越的检测性能与鲁棒性。

Comments Accepted by IJCAI 2026

详情
英文摘要

Graph-level anomaly detection (GLAD) is crucial for ensuring the reliability of graph-driven applications by identifying abnormal graphs that deviate from the majority. Considering the privacy concerns in distributed scenarios, federated graph-level anomaly detection (FedGLAD) has emerged as a promising solution to enable collaborative detection without sharing raw data. However, existing methods suffer from poor generalization due to the reliance on unrealistic synthetic anomalies and insufficient personalization capabilities under data heterogeneity. To address these challenges, we propose a novel Federated graph-level anomaly detection approach with Cluster-adaptIve GAted Reconstruction (FedCIGAR). Specifically, we design a reconstruction-based paradigm trained on normal graphs to avoid synthetic data. Furthermore, we introduce a client-side node contribution gating mechanism and a server-side sliding window-based clustering strategy to tackle data heterogeneity. Extensive experiments demonstrate that FedCIGAR achieves superior performance and robustness in contrast to state-of-the-art methods.

2605.09425 2026-05-12 cs.CV cs.AI

AtteConDA: Attention-Based Conflict Suppression in Multi-Condition Diffusion Models and Synthetic Data Augmentation

Shogo Noguchi

发表机构 * Gunma University(群马大学)

AI总结 本文研究了多条件扩散模型中条件冲突对图像生成结构保真度的影响,提出了一种基于注意力机制的冲突抑制方法,有效提升了生成图像的高层结构一致性。通过结合语义分割、深度图和边缘信息作为多条件输入,模型能够在保持场景细节的同时生成高质量的图像,用于自动驾驶任务的数据增强。该工作不仅解决了多条件生成中的冲突问题,还构建了针对驾驶任务的生成框架与评估体系,为缓解高阶自动驾驶中数据稀缺问题提供了重要支持。

Comments 44 pages, 20 figures. Code and project page available at: https://github.com/ShogoNoguchi/AtteConDA

详情
英文摘要

Recent conditional image generation methods can improve controllability by generating images that are faithful to conditions such as sketches, human poses, segmentation maps, and depth. By applying these techniques to image augmentation while preserving annotations, generated images can be used as additional training data and can improve recognition performance. However, for high-level driving tasks such as traffic-rule extraction and driving-behavior understanding, simply using annotations as conditions is insufficient. Instead, images must be augmented while preserving the detailed high-level structure of the original scene. One possible solution is to use multiple conditions so that generated images retain diverse structural cues after generation. However, when multiple conditions are used, conflicts among conditions can prevent reliable structure preservation. In this work, we input semantic segmentation, depth, and edges extracted from the original image into a multi-condition image generation model, thereby providing rich structural information as conditions. We further propose a modeling approach for handling conflicts among multiple conditions and show that it enables image generation with stronger structural preservation. We also build a generation framework and evaluation protocol for driving tasks, establishing a basis for comparison with prior and future models. As a result, this work contributes to image generation research by addressing condition conflicts in multi-condition generation and provides an important step toward mitigating data scarcity in high-level autonomous-driving tasks.

2605.09424 2026-05-12 cs.LG

Tabular Foundation Model for Generative Modelling

Xiangjian Jiang, Mingxuan Liu, Nikola Simidjievski, Tassilo Klein, Mateja Jamnik

发表机构 * Department of Computer Science and Technology, University of Cambridge, UK(计算机科学与技术系,剑桥大学,英国) SAP SE Télécom Paris, Institut Polytechnique de Paris, France(巴黎telecom,巴黎理工 institute,法国)

AI总结 本文提出了一种名为 TabFORGE 的新型表格基础生成模型,旨在解决现有表格生成模型在合成数据质量上不足的问题。该模型通过预训练的因果感知特征编码器,在统一的潜在空间中学习表格数据的隐含因果信息,并采用两阶段设计,先预训练基于分数的扩散变压器,再预训练与去噪对齐的解码器,从而有效缓解潜在表示在训练与推理间的分布偏移。实验表明,TabFORGE 能够高效生成高质量的合成表格数据,尤其在结构保真度方面表现突出。

详情
英文摘要

Generative modelling is a demanding test of foundation models, because it requires robust, holistic representation learning for a given data modality, rather than optimisation for a supervised prediction target alone. While recent work on tabular foundation models has achieved remarkable progress in predictive modelling, generative tabular foundation models remain underexplored. Existing tabular foundation generators, in particular, have not yet consistently matched strong dataset-specific generators in synthetic data quality. A key reason is their misalignment with the distinctive causal structural prior of heterogeneous tabular data. In this paper, we address this gap by introducing a novel tabular foundation model, \textbf{TabFORGE}, built on pretrained \textbf{Tab}ular \textbf{FO}undational \textbf{R}epresentations for \textbf{GE}neration. TabFORGE is designed to utilise the implicitly learned causal information underlying diverse tabular datasets in a unified latent space induced by a pretrained causality-aware feature encoder. It further decouples latent modelling from decoding through a two-stage design: we first pretrain a score-based diffusion transformer, and then pretrain a denoising-aligned decoder using the denoised latent embeddings. This design elegantly mitigates the distribution shifts in latent embeddings that typically arise between training and inference. We evaluate TabFORGE comprehensively against 22 benchmark methods on 45 real-world datasets. Our results show that TabFORGE effectively learns and leverages generalisable tabular representations, enabling efficient generation of high-quality synthetic tabular data, particularly with strong structural fidelity.

2605.09422 2026-05-12 cs.CL cs.CV

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

Jiafeng Liang, Zhihao Zhu, Zihan Zhang, Baoqi Ren, Shixin Jiang, Runxuan Liu, Tao Ren, Ming Liu, See-Kiong Ng, Bing Qin

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Pengcheng Laboratory(鹏城实验室) National University of Singapore(新加坡国立大学) Peking University(北京大学) Harvard University(哈佛大学)

AI总结 尽管大型多模态模型(LMMs)在视频理解方面表现出色,但它们在因果发现过程中容易依赖文本先验信息,这一缺陷尚未被充分理解。本文提出了一种基于扰动的评估方法ProCauEval,通过系统控制视觉和文本模态的输入,揭示模型在因果推理中的失效模式。研究发现,主流LMMs虽然能够准确感知视频内容,但在因果推理中未能充分加以利用,并且更强的后训练反而加剧了对文本先验的依赖。为此,作者提出了一种反蒸馏策略优化框架ADPO,通过强化学习推动模型更依赖视觉证据而非文本捷径,实验表明该方法有效提升了模型的视觉参与度并保持了基础理解能力。

Comments 17 pages, 5 figures

详情
英文摘要

Although Large Multimodal Models (LMMs) have achieved strong performance on general video understanding, their susceptibility to textual prior shortcuts during causal discovery has been recognized as a critical deficit. The underlying mechanisms of this phenomenon remain incompletely understood, as existing benchmarks only measure response accuracy without revealing the sources and extent of the deficit. We introduce ProCauEval, a perturbation-based evaluation protocol that shifts from outcome assessment to mechanism diagnosis, probing causal discovery through five controlled configurations that systematically manipulate visual and textual modalities to decompose their respective contributions to model behavior and dissect the failure modes. Evaluating 17 mainstream LMMs, we find that models faithfully perceive video content yet systematically underexploit it during causal reasoning. We further observe that stronger post-training amplifies rather than mitigates textual prior reliance, and that higher baseline performance correlates with greater fragility under perturbation. To address these, we propose Anti-Distillation Policy Optimization (ADPO), a reinforcement learning framework built on negative teacher alignment, which augments GRPO by explicitly pushing the policy away from a prior-only counterfactual teacher induced by visual corruption. Specifically, ADPO maximizes the divergence between the policy distributions conditioned on the original and visually corrupted inputs, thereby forcing the model to ground its reasoning in visual evidence rather than textual shortcuts. Extensive experiments show that ADPO improves visual engagement without sacrificing fundamental comprehension, thus offering a preliminary step toward reliable causal discovery.

2605.09419 2026-05-12 cs.AI

From Passive Reuse to Active Reasoning: Grounding Large Language Models for Neuro-Symbolic Experience Replay

Yanan Xiao, Yixiang Tang, Zechen Feng, Lu Jiang, Minghao Yin, Pengyang Wang

发表机构 * Affiliation(机构1)

AI总结 本文提出了一种名为Neuro-Symbolic Experience Replay(NSER)的新框架,旨在将强化学习中的经验回放从被动记忆机制转变为具备主动推理能力的知识构建引擎。该方法通过结合大型语言模型(LLM)与符号逻辑表示,从累积的轨迹中归纳行为规则,并将其转化为可微分的逻辑表达式,从而动态调整经验回放的分布权重。NSER通过让抽象知识直接指导策略优化,在多种基准任务中实现了更高的样本效率和收敛速度。

详情
英文摘要

While experience replay is essential for data efficiency in reinforcement learning (RL), standard methods treat the replay buffer as a passive memory system, prioritizing samples based on numerical prediction errors rather than their semantic significance. This approach stands in contrast to human learning, which accelerates mastery by actively abstracting fragmented experiences into behavioral rules. To bridge this gap, we propose Neuro-Symbolic Experience Replay (NSER), a framework that transforms experience replay from a passive sample reuse mechanism into an active engine for knowledge construction. Specifically, NSER addresses the incompatibility between linguistic reasoning and numerical optimization through a novel neuro-symbolic grounding pipeline. It leverages Large Language Models (LLMs) in a zero-shot manner to induce candidate behavioral rules from accumulated trajectories, grounds these insights into differentiable first-order logic representations, and utilizes the resulting symbolic structures to dynamically reweight the replay distribution. By allowing abstract knowledge to directly shape policy optimization, NSER achieves consistent superior sample efficiency and convergence speed across reactive, rule-based, and procedural benchmarks.

2605.09418 2026-05-12 cs.CV cs.RO

MAG-VLAQ: Multi-modal Aerial-Ground Query Aggregation for Cross-View Place Recognition

Zhengyi Xu, Yuhang Ming, Zhihao Zhan, Hanyu Zhu, Javier Civera, Wanzeng Kong

发表机构 * Hangzhou Dianzi University(杭州电子科技大学) TopXGun Robotics(TopXGun机器人) University of Zaragoza(萨拉戈萨大学)

AI总结 跨视角场景识别在计算机视觉与机器人领域面临诸多挑战,尤其在地面观测与空中参考之间存在显著的视角、模态和结构差异。为此,本文提出MAG-VLAQ框架,通过融合预训练基础模型提取的多模态特征,在共享嵌入空间中实现地面与空中图像的对齐与融合。其核心创新在于引入ODE条件化的VLAQ机制,动态调整查询中心以适应多模态信息,从而在保持全局检索原型的同时提升场景特异性匹配能力。实验表明,该方法在KITTI360-AG数据集上显著优于现有方法,Recall@1指标达到61.1。

Comments 16 pages, 4 figures, 3 tables

详情
英文摘要

Multi-modal cross-view place recognition remains a fundamental challenge in computer vision and robotics due to the severe viewpoint, modality, and spatial-structure discrepancies between ground observations and aerial references. To address this challenge, we present MAG-VLAQ, a foundation-model-enhanced query aggregation framework for multi-modal aerial-ground cross-view place recognition. Specifically, our approach leverages pre-trained foundation models to extract dense visual tokens from both ground and aerial images, as well as expressive geometric tokens from ground LiDAR observations. These heterogeneous tokens are then projected into a shared embedding space for cross-modal alignment and fusion. As our main contribution, we propose ODE-conditioned VLAQ, which tightly couples neural ordinary differential equations (ODE)-based RGB-LiDAR fusion with vectors of locally aggregated queries (VLAQ). In this design, the VLAQ query centers are dynamically adapted according to the fused multi-modal state. This mechanism allows the final global descriptor to preserve globally learned retrieval prototypes while remaining responsive to scene-specific visual and geometric evidence, significantly improving aerial-ground matching. Extensive experiments on KITTI360-AG and nuScenes-AG validate the effectiveness of our proposed MAG-VLAQ. Notably, on KITTI360-AG, our MAG-VLAQ nearly doubles the state-of-the-art performance, achieving 61.1 Recall@1 in the satellite setting, compared with 34.5 from the closest competing approach.

2605.09417 2026-05-12 cs.CV

SAMOFT: Robust Multi-Object Tracking via Region and Flow

Yanchao Wang, Dawei Zhang, Chengzhuan Yang, Wei Liu, Minglu Li, Hua Wang, Zhonglong Zheng, Ming-Hsuan Yang

发表机构 * School of Computer Science and Technology, Zhejiang Normal University(浙江师范大学计算机科学与技术学院) Institute for Sustainable Industries and Liveable Cities, College of Engineering and Science, Victoria University(维多利亚大学可持续产业与宜居城市研究所、工程与科学学院) School of Electrical Engineering and Computer Science, University of California at Merced(加州大学默塞德分校电子工程与计算机科学学院)

AI总结 本文提出了一种名为SAMOFT的鲁棒多目标跟踪方法,旨在解决复杂运动场景下目标形变、非线性运动和遮挡带来的跟踪难题。该方法引入像素级运动匹配模块(PMM),结合Segment Anything Model(SAM)和密集光流,提升基于卡尔曼滤波的运动预测精度;同时设计了中心距匹配(CDM)模块和分布校正(DBC)模块,分别增强对低置信度检测的鲁棒性以及在线轨迹状态的动态修正能力。实验表明,SAMOFT在多个基准数据集上显著优于现有方法,验证了其有效性。

详情
英文摘要

Multi-object tracking (MOT) is a fundamental task in computer vision that requires continuously tracking multiple targets while maintaining consistent identities across frames. However, most existing approaches primarily rely on instance-level object features for trajectory association, which often leads to degraded performance under challenging conditions such as object deformation, nonlinear motion, and occlusion. In this work, we propose SAMOFT, a robust tracker that leverages pixel-level cues to improve robustness under complex motion scenarios. Specifically, we introduce a Pixel Motion Matching (PMM) module that integrates the Segment Anything Model (SAM) with dense optical flow to refine Kalman filter-based motion prediction using instantaneous foreground pixel motion. To further enhance robustness under unreliable detections, we design a Centroid Distance Matching (CDM) module that performs flexible mask-based centroid matching for low-confidence or partially occluded observations. Moreover, a Distribution-Based Correction (DBC) module models long-tailed motion patterns in a training-free manner using historical optical flow statistics and dynamically corrects trajectory states online. We also incorporate a Cluster-Aware ReID (CA-ReID) strategy to improve the stability and discriminative power of trajectory appearance features. Extensive experiments on the DanceTrack and MOTChallenge benchmarks demonstrate that SAMOFT consistently improves baseline trackers and achieves competitive performance compared with recent state-of-the-art methods, validating the effectiveness of leveraging pixel-level cues for robust multi-object tracking.

2605.09416 2026-05-12 cs.LG

A Controlled Diagnostic Study of Hardware-Induced Distortions in Hardware-Aware Training

Yunxuan Fang, Xinhe Wang

发表机构 * Beihang University(北航)

AI总结 本文研究了硬件非理想特性对神经网络训练的影响,提出了一种诊断框架,将硬件引起的失真建模为前向操作的结构化扰动,并评估其与梯度优化的兼容性。通过分析六类典型扰动,发现了三个关键诊断指标,揭示了哪些硬件失真可以通过训练补偿,哪些会破坏优化过程,为软硬件协同设计提供了重要指导。

详情
英文摘要

Hardware-aware training (HAT) is widely used to improve the robustness of neural networks on non-ideal AI accelerators, such as analog in-memory computing (IMC) systems. However, not all hardware-induced distortions are equally compensable by training. This paper presents a diagnostic framework that models hardware non-idealities as structured perturbations of the forward operator and evaluates their compatibility with gradient-based optimization. We analyze six representative perturbation classes--read noise, variability, drift, stuck-at faults, IR-drop, and ADC discretization--and identify three key diagnostics: gradient expectation consistency, bounded gradient variance, and non-degenerate sensitivity. Our results show a clear separation between perturbations that can be compensated by HAT and those that consistently break optimization. This provides practical guidance for hardware-software co-design, clarifying which non-idealities can be addressed at the training level and which require circuit-, architecture-, or calibration-level mitigation. This study should be interpreted as a controlled empirical analysis under vanilla forward-perturbation HAT, rather than as a universal theory of hardware-aware training.

2605.09414 2026-05-12 cs.CL

Cross-Cultural Transfer of Emoji Semantics and Sentiment in Financial Social Media

Ahmed Mahrous, Roberto Di Pietro

发表机构 * King Abdullah University of Science and Technology(卡塔尔国王阿卜杜勒阿齐兹大学科学与技术学院)

AI总结 该研究探讨了在金融社交媒体中表情符号的语义和情感在跨语言、跨平台及跨资产社区中的可迁移性。通过分析多语言的Twitter和StockTwits数据,研究发现尽管表情符号的使用频率在不同社区中存在差异,但其语义和情感极性具有较高的稳定性。研究还表明,结合表情符号的信息有助于提升情感迁移模型的性能,尤其在跨语言迁移中效果显著,揭示了金融交流中存在部分共享的“表情符号代码”。

Comments Accepted to Findings of the Association for Computational Linguistics: ACL 2026

详情
英文摘要

Emojis are widely used in online financial communication, but it is unclear whether they provide transferable sentiment signals across languages, platforms, and asset communities. This study examines the extent to which emoji usage, semantics, and sentiment polarity remain stable across financial communities, and how these layers influence zero-shot sentiment transfer. Using large corpora of Twitter and StockTwits posts in four languages, we measure cross-community divergence and evaluate sentiment models trained under emoji-only, text-only, and text+emoji inputs. We find that emoji frequencies differ across communities, especially across languages, but their semantics and sentiment polarity are largely stable. Cross-asset transferability shows minimal degradation, while cross-language transfer remains the most challenging. Including emojis consistently reduces transfer gaps relative to text-only models. These results indicate that financial communication exhibits a partially shared ``emoji code,'' and that emojis provide compact, language-independent sentiment cues that improve model generalization across markets and platforms.

2605.09410 2026-05-12 cs.RO cs.AI

RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models

Weijia Liufu, Xiaoyu Guo, Ruiyi Chen, Jingzhi Liu, Kaidong Zhang, Xiwen Liang, Jianqi Lin, Dawei Sun, Yuze Wang, Rongtao Xu, Bingqian Lin, Bowen Yang, Tongtong Cao, Bowen Peng, Dongyu Zhang, Guangrun Wang, Min Wang, Liang Lin, Xiaodan Liang

发表机构 * Sun Yat-sen University(中山大学) South China University of Technology(华南理工大学) Peng Cheng Laboratory(鹏城实验室) Harbin Institute of Technology(哈尔滨工业大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文提出RePO-VLA,一种面向视觉-语言-动作(VLA)模型的恢复驱动策略优化框架,旨在提升其在复杂操作任务中的鲁棒性。该方法通过区分成功、恢复和失败轨迹的角色,结合恢复感知初始化和进展感知语义价值函数,有效利用失败数据中的有用信息进行策略优化。实验表明,RePO-VLA在模拟和现实双臂任务中显著提升了对抗性场景下的成功率,平均从20%提升至75%,在实际测试中甚至达到80%。

详情
英文摘要

Vision-Language-Action (VLA) models remain brittle in long-horizon, contact-rich manipulation because success-only imitation provides little supervision for execution drift, while failed rollouts are often discarded. We introduce RePO-VLA, a recovery-driven policy optimization framework that assigns distinct roles to success, recovery, and failure trajectories. RePO-VLA first applies Recovery-Aware Initialization (RAI), slicing recovery segments and resetting history so corrective actions depend on the current adverse state rather than the preceding failure. It then learns a Progress-Aware Semantic Value Function (PAS-VF), aligning spatiotemporal trajectory features with instructions and successful references. The resulting labels salvage useful failure prefixes via reliability decay, while low-value labels mark drift and terminal breakdowns, teaching differences among nominal, failed, and corrective actions. The data engine turns adverse states into planner-generated or human-collected corrective rollouts, teaching recovery to the success manifold. Value-Conditioned Refinement (VCR) trains the policy to prefer high-progress actions. At deployment, a fixed high value ($v=1.0$) biases actions toward the learned success manifold without online failure detectors or heuristic retries. We introduce FRBench, with standardized error injection and recovery-focused evaluation. Across simulated and real-world bimanual tasks, RePO-VLA improves robustness, raising adversarial success from 20% to 75% on average and up to 80% in scaled real-world trials.

2605.09408 2026-05-12 cs.LG cs.SI stat.ML

GravityGraphSAGE: Link Prediction in Directed Attributed Graphs

Riccardo Porcedda, Francesca Chiaromonte, Fabrizio Lillo, Andrea Vandin

发表机构 * Department of Excellence L’EMbeDS, Sant’Anna School of Advanced Studies(卓越部门L’EMbeDS,圣安娜高级研究学校) Department of Computer Science, University of Pisa(比萨大学计算机科学系) Department of Statistics and Huck Institutes of the Life Sciences, The Pennsylvania State University(统计学与生命科学学院,宾夕法尼亚州立大学) Class of Science, Scuola Normale Superiore(科学班级,正规大学) DTU Technical University of Denmark(丹麦技术大学)

AI总结 本文研究了有向属性图中的链接预测问题,即预测图中节点之间缺失或未来的连接关系。为了解决现有方法在处理有向图和节点属性时的不足,作者提出了基于引力机制的改进版GraphSAGE模型——GravityGraphSAGE(GG-SAGE),首次将GraphSAGE应用于有向链接预测任务。实验表明,该模型在多个基准数据集和真实网络数据上优于现有最先进的图深度学习链接预测方法,展示了其在复杂图结构中的有效性与扩展性。

详情
英文摘要

Link prediction (inferring missing or future connections between nodes in a graph) is a fundamental problem in network science with widespread applications in, e.g., biological systems, recommender systems, finance and cybersecurity. The ability to accurately predict links has significant real-world applications, such as detecting fraudulent financial transactions or identifying drug-target interactions in biomedicine. Despite a rich literature, link prediction is still challenging, especially for graphs enriched with information on edges (direction) and nodes (attributes). In fact, research on link prediction, especially the one based on Graph Deep Learning (GDL), has mostly focused on undirected graphs, without fully leveraging node attributes. Here, we fill this gap by proposing Gravity-GraphSAGE (GG-SAGE), a modified version of GraphSAGE, a GDL model for node embeddings, composed of a gravity-inspired decoder. This implementation is the first example in the literature of a GraphSAGE backbone adopted for directed link prediction. Using the benchmark datasets Cora, Citeseer, PubMed and 16 real-world graphs from the online Netzschleuder repository, we show that our proposed model outperforms state-of-the-art GDL link prediction techniques. Using further experimental evidence, we relate the quality of the output of our model with various characteristics of the graph, suggesting that our framework scales well when applied to data of increasing complexity.

2605.09407 2026-05-12 cs.CV

AnyDepth-DETR/-YOLO: Any-depth object detection with a single network

Woochul Kang, Hyungseop Lee, Jiho Lee

发表机构 * Incheon Nat’l Univ.(Incheon国立大学)

AI总结 本文提出了一种名为AnyDepth-DETR/-YOLO的任意深度目标检测框架,使单个网络能够在推理时通过控制深度实现精度与效率的连续权衡,无需重新训练。该方法通过将网络的主干和颈部模块分解为必须执行的主路径和可跳过的细化路径,保持了不同深度配置下的多尺度特征层次。通过在最深和最浅网络之间进行自蒸馏,并结合预测层和特征层对齐损失,确保各阶段输出的兼容性。实验表明,该方法在RT-DETR和YOLOv12上实现了与现有最佳模型相当或更优的性能,且在高效配置下可提升1.82倍速度,仅损失2.0 AP。

Comments 16 pages, 5 figures, 9 tables

详情
英文摘要

Modern object detectors are static, fixed-depth networks optimized for a single operating point, requiring separate models for different deployment scenarios. We present an any-depth detection framework that enables a single network to span a continuous range of accuracy--efficiency trade-offs by controlling depth at inference time without retraining. Each backbone and neck stage is divided into an essential path, which always executes, and a skippable refinement path; this decomposition preserves the full multi-scale feature hierarchy at every depth configuration, unlike conventional early exiting that discards entire stages. To train such a network, jointly optimizing many sub-networks of varying depth introduces conflicting gradient signals. We address this via self-distillation between only the two extremes, with prediction-level and feature-level alignment losses that enforce stage-wise modularity, ensuring the outputs of each stage remain compatible regardless of the paths taken. Instantiated on RT-DETR and YOLOv12, our full-depth configurations match or surpass their respective SOTA baselines with negligible parameter overhead, while the most efficient configurations achieve up to $1.82\times$ speedup at a cost of only 2.0 AP, all from a single set of weights.

2605.09404 2026-05-12 cs.LG cs.CL cs.CV

Let the Target Select for Itself: Data Selection via Target-Aligned Paths

Huitao Yang, Hengzhi He, Guang Cheng

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 该研究针对目标导向的数据选择问题,提出了一种新的参考路径方法,以减少传统方法在异构数据池中可能产生的偏差。通过在目标验证集上进行短期预热,生成一个验证诱导的参考路径,并利用该路径上的终点损失下降作为候选样本的评分依据,从而实现无需梯度或海森矩阵近似的选择策略。该方法在多个实验中表现出与动态归因方法相当的性能,同时显著降低了预热和存储成本,并可复用到不同的数据池中。

详情
英文摘要

Targeted data selection aims to identify training samples from a large candidate pool that improve performance on a specific downstream task. Many recent methods estimate candidate utility by aggregating local attribution scores along a trajectory induced by the candidate pool. When the pool is heterogeneous, however, this reference trajectory may be misaligned with the dynamics of a target-aligned selected subset, creating what we call reference path bias. We propose an alternative reference path: a validation-induced flow obtained from a short, capacity-limited warmup on the available target validation proxy. Along this path, candidates are scored by a normalized endpoint loss drop, yielding a simple zero-order selection rule that requires no candidate gradients or Hessian approximations. Across controlled logistic, vision, and instruction-tuning experiments, this score is competitive with strong dynamic attribution baselines while substantially reducing warmup and storage cost. Moreover, since the reference trajectory is decoupled from any specific candidate pool, the same compact warmup can be reused across additional pools without recomputing the trajectory.

2605.09400 2026-05-12 cs.LG

D2ACE: Multi-Label Batch Selection Guided by Dual Dynamics and Adaptive Correlation Enhancement

Bin Liu, Haoyu Peng, Zhijia Wei, Jiajing Zhang, Grigorios Tsoumakas

发表机构 * Key Laboratory of DECV, Chongqing University of Posts and Telecommunications(重庆邮电大学信息与通信工程重点实验室) School of Computer Science and Technology, Chongqing University of Posts and Telecommunications(重庆邮电大学计算机科学与技术学院) School of Informatics, Aristotle University of Thessaloniki(希腊塞萨洛尼基阿里斯托芬大学信息学院)

AI总结 在深度多标签分类中,批样本选择对提升训练效率和预测性能至关重要。现有方法通常依赖单一指标评估样本重要性,并使用静态标签权重,忽视了训练过程中指标效用和标签重要性的动态变化。为解决这些问题,本文提出D2ACE方法,结合双动态机制和自适应相关性增强,通过阶段化伯努利混合采样和动态标签加权,动态调整标签优先级,并引入局部上下文感知的相关性增强以聚焦相关标签,实验表明该方法在多种模型和数据集上均表现出更优的预测性能和更高效的标签关联建模。

Comments 18 pages

详情
英文摘要

Batch selection is crucial for improving both training efficiency and predictive performance in deep multi-label classification (MLC). Existing batch selection methods typically rely on a single metric to assess instance importance and use static label weights to distinguish label significance, neglecting the dynamic evolution of metric utility and label significance during training. In addition, the method that explicitly exploits label correlations is largely affected by abundant irrelevant labels and insensitive to local label distributions. To address these issues, we propose D2ACE, a novel multi-label batch selection method guided by Dual Dynamics and Adaptive Correlation Enhancement. D2ACE explicitly captures metric and label-level training dynamics by combining stage-wise Bernoulli mixture sampling, which balances uncertainty and noise-resistant hardness, with dynamic label weighting to recalibrate label priorities at each epoch based on current metric statistics. Furthermore, D2ACE introduces a local context-aware correlation enhancement to focus on relevant labels with instance-adaptive dependencies. Extensive experiments on tabular and image benchmarks demonstrate that D2ACE outperforms existing batch selection approaches across various deep MLC models, achieving stronger predictive performance and more efficient correlation modeling.

2605.09392 2026-05-12 cs.CV

HyNeuralMap: Hyperbolic Mapping of Visual Semantics to Neural Hierarchies

Zihan Ma, Tian Xia, Kexin Wang, Xiao Li, Xiaowei He, Yudan Ren

发表机构 * School of Electronic Information (School of Artificial Intelligence), the Xi’an Key Laboratory of Radiomics and Intelligent Perception, Northwest University(电子信息学院(人工智能学院)、西安放射组学与智能感知重点实验室、西北大学)

AI总结 本文提出了一种名为HyNeuralMap的框架,用于将视觉语义映射到跨被试的神经层次结构中,以解决视觉刺激与神经响应之间复杂映射关系的理解问题。该方法利用双曲洛伦兹模型,通过双曲空间的负曲率作为归纳偏置,更有效地捕捉视觉语义的层次结构和跨被试神经相似性。实验表明,HyNeuralMap在多标签语义预测和跨模态检索任务中优于现有的欧氏空间方法,验证了双曲几何在跨模态语义对齐和层次建模中的优势。

Comments 14 pages, 4 figures

详情
英文摘要

Understanding the intricate mappings between visual stimuli and neural responses is a fundamental challenge in cognitive neuroscience. While current approaches predominantly align images and functional magnetic resonance imaging (fMRI) responses in Euclidean space, this geometry often struggles to preserve fine-grained semantic relationships and latent hierarchical structures across visual and neural modalities. To overcome this, we propose HyNeuralMap, a framework that employ hyperbolic Lorentz model to map visual semantics into a shared, cross-subject neural hierarchy. By leveraging the negative curvature of hyperbolic space as an inductive bias, the proposed framework better captures hierarchical semantic organization and cross-subject neural similarities. Specifically, visual and neural embeddings are jointly optimized through hyperbolic geometric alignment, where geodesic distances preserve semantic proximity and hierarchical relationships more effectively than Euclidean embeddings. Experiments demonstrate that HyNeuralMap consistently outperforms state-of-the-art Euclidean baselines in both multi-label semantic prediction and cross-modal retrieval tasks. This confirms hyperbolic geometry's superiority for cross-modal semantic alignment and hierarchical modeling, providing a new avenue for vision-neural representation learning.

2605.09387 2026-05-12 cs.AI cs.RO

NEXUS: Continual Learning of Symbolic Constraints for Safe and Robust Embodied Planning

Tiehan Cui, Peipei Liu, Yanxu Mao, Congying Liu, Mingzhe Xing, Datao You

发表机构 * School of Artificial Intelligence and Automation(人工智能与自动化学院) Huazhong University of Science and Technology(华中科技大学) School of Software(软件学院) Henan University(河南大学) Institute of Information Engineering(信息工程研究所) Chinese Academy of Sciences(中国科学院) School of Cyberspace Security(网络空间安全学院) University of the Chinese Academy of Sciences(中国科学院大学) Peking University(北京大学)

AI总结 本文提出了一种名为NEXUS的模块化框架,旨在解决具身智能体在持续学习过程中面临的符号约束学习问题。该框架通过将物理可行性与安全规范解耦,结合闭环执行反馈与概率风险评估,实现了对安全指令的严格验证与风险规避。实验表明,NEXUS在任务成功率、安全指令拒绝能力及对抗攻击防御方面表现优异,并能通过知识积累逐步提升规划效率。

详情
英文摘要

While Large Language Models (LLMs) have catalyzed progress in embodied intelligence, a fundamental gap between their inherent probabilistic uncertainty and the strict determinism and verifiable safety required in the physical world. To mitigate this gap, this paper introduces NEXUS, a modular framework designed for continual learning in embodied agents. Different from prior works that treat symbolic artifacts merely as static interfaces, NEXUS leverages them for symbolic grounding and knowledge evolution. The framework explicitly decouples physical feasibility from safety specifications: capability of agents is improved through closed-loop execution feedback, while probabilistic risk assessments are grounded into deterministic hard constraints to establish a rigorous pre-action defense. Experiments on SafeAgentBench demonstrate that NEXUS achieves superior task success rates while effectively refusing unsafe instructions, exhibiting robust defense against adversarial attacks, and progressively improving planning efficiency through knowledge accumulation.

2605.09384 2026-05-12 cs.CV cs.AI q-bio.QM

LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering

Runze Ma, Shunbo Jia, Haonan Lyu, Guo Liu, Caizhi Liao

发表机构 * School of Information Technology(信息科技学院) Monash University Malaysia(墨尔本大学马来西亚分校) Faculty of Innovation Engineering(创新工程学院) Macau University of Science and Technology(澳门科学技术大学) Department of Bioelectronics(生物电子系) Faculty of Biomedical Engineering(生物医学工程学院) Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 本文提出了一种名为LiteMedCoT-VL的参数高效的适配方法,旨在提升医疗视觉问答(VQA)模型在资源受限设备上的推理能力。该方法通过基于LoRA的微调,将大型教师模型的链式推理能力迁移至小型学生模型,且无需依赖图像字幕,更贴近实际临床场景。实验表明,LiteMedCoT-VL在PMC-VQA基准测试中取得了64.9%的准确率,显著优于现有基线模型,验证了小参数模型通过推理蒸馏可达到甚至超越更大模型的效果。

Comments 17 pages, 5 figures

详情
英文摘要

The reasoning gap between large and compact vision-language models (VLMs) limits the deployment of medical AI on portable clinical devices. Compact VLMs of 2--4B parameters can run on resource-constrained hardware but lack the multi-step reasoning capacity needed for interpretable clinical decision support. Existing knowledge distillation methods transfer answers without the reasoning process behind them. Medical visual question answering (VQA) serves as a testbed for this problem, as it requires models to integrate visual evidence with clinical knowledge through structured reasoning chains. We introduce LiteMedCoT-VL, a pipeline that transfers chain-of-thought reasoning from a 235B teacher model to 2B student models through LoRA-based fine-tuning on explanation-enriched training data. All inference is conducted without image captions by default, simulating the clinical scenario in which a physician interprets a medical image directly without an accompanying radiology report. On the PMC-VQA benchmark, LiteMedCoT-VL achieves 64.9% accuracy, exceeding the zero-shot Qwen3-VL-4B baseline of 53.9% by 11.0 percentage points and outperforming all published baselines. This result indicates that a 2B model with reasoning distillation can match or exceed models with twice the parameters. Visual grounding analysis shows that the model relies on image content rather than exploiting textual priors. Our code is publicly available at https://anonymous.4open.science/r/LiteMedCoT-VL.