arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
2603.07686 2026-05-12 cs.RO cs.CV

UniUncer: Unified Dynamic Static Uncertainty for End to End Driving

Yu Gao, Jijun Wang, Zongzheng Zhang, Anqing Jiang, Yiru Wang, Yuwen Heng, Shuo Wang, Hao Sun, Zhangfeng Hu, Hao Zhao

发表机构 * Bosch Corporate Research(博世企业研究)

AI总结 该论文提出了一种名为UniUncer的统一动态静态不确定性框架,用于端到端自动驾驶系统,旨在提升系统对环境不确定性的感知与应对能力。该方法通过将确定性模型转换为概率回归模型,同时引入不确定性融合模块和不确定性感知门控机制,实现了对静态地图元素和动态交通参与者不确定性的联合建模与利用。实验表明,UniUncer在多个基准数据集上有效提升了轨迹预测和驾驶决策的性能,且计算开销极小。

Comments Accepted ICRA 2026

详情
英文摘要

End-to-end (E2E) driving has become a cornerstone of both industry deployment and academic research, offering a single learnable pipeline that maps multi-sensor inputs to actions while avoiding hand-engineered modules. However, the reliability of such pipelines strongly depends on how well they handle uncertainty: sensors are noisy, semantics can be ambiguous, and interaction with other road users is inherently stochastic. Uncertainty also appears in multiple forms: classification vs. localization, and, crucially, in both static map elements and dynamic agents. Existing E2E approaches model only static-map uncertainty, leaving planning vulnerable to overconfident and unreliable inputs. We present UniUncer, the first lightweight, unified uncertainty framework that jointly estimates and uses uncertainty for both static and dynamic scene elements inside an E2E planner. Concretely: (1) we convert deterministic heads to probabilistic Laplace regressors that output per-vertex location and scale for vectorized static and dynamic entities; (2) we introduce an uncertainty-fusion module that encodes these parameters and injects them into object/map queries to form uncertainty-aware queries; and (3) we design an uncertainty-aware gate that adaptively modulates reliance on historical inputs (ego status or temporal perception queries) based on current uncertainty levels. The design adds minimal overhead and drops throughput by only $\sim$0.5 FPS while remaining plug-and-play for common E2E backbones. On nuScenes (open-loop), UniUncer reduces average L2 trajectory error by 7\%. On NavsimV2 (pseudo closed-loop), it improves overall EPDMS by 10.8\% and notable stage two gains in challenging, interaction-heavy scenes. Ablations confirm that dynamic-agent uncertainty and the uncertainty-aware gate are both necessary.

2603.02678 2026-05-12 cs.LG cs.ET cs.HC stat.ME stat.ML

Causal Discovery Should Embrace the Wisdom of the Crowd

Ryan Feng Lin, Yuantao Wei, Huiling Liao, Xiaoning Qian, Shuai Huang

发表机构 * Department of Industrial and Systems Engineering, University of Washington(华盛顿大学工业与系统工程系) Department of Applied Mathematics, Illinois Institute of Technology(伊利诺伊理工学院应用数学系) Department of Electrical & Computer Engineering, Texas A&M University(德克萨斯阿灵顿大学电气与计算机工程系)

AI总结 本文提出了一种基于“群体智慧”的因果学习新范式,主张通过整合多人提供的分散且可能带有噪声的因果知识来构建全局因果结构。研究引入了众包平台、专家知识获取与聚合技术以及大语言模型辅助的信息获取等手段,构建了一个涵盖知识获取、建模、聚合与优化的群体因果学习框架。该方法为因果学习提供了新的研究方向,同时也带来了跨学科合作的机遇与挑战。

详情
英文摘要

This paper argues for recognizing an emerging paradigm of causal learning by wisdom of the crowd. Recent developments in government, industry, and research point to the rise of decentralized and crowd-based approaches within causal modeling, where causal knowledge distributed across many contributors can be systematically elicited and integrated with causal learning workflows. In this paradigm, causal learning becomes a distributed decision-making problem: each participant contributes partial and potentially noisy knowledge, while collective contributions help construct a global causal structure. This direction is enabled by advances in crowdsourcing platforms, expert knowledge elicitation, aggregation techniques, and large language model (LLM)-augmented information acquisition. Its promise is increasingly visible in early research and emerging real-world practices. Building on this momentum, we outline a framework for crowd-based causal learning spanning elicitation, modeling, aggregation, and optimization. We further discuss the opportunities and challenges introduced by this paradigm and call for interdisciplinary collaboration across causal learning, collective intelligence, human-AI interaction, and decision science.

2603.00918 2026-05-12 cs.CV cs.AI

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

Seungwook Kim, Minsu Cho

发表机构 * POSTECH(POSTECH大学) RLWRLD(RLWRLD实验室) GenGenAI(GenGenAI研究院)

AI总结 本文提出了一种名为SOLACE的后训练框架,用于提升文本到图像生成的质量。该方法通过模型自身对生成图像进行重噪声处理,并衡量其恢复噪声的准确性,从而生成内在的自信信号作为强化学习的奖励,无需外部奖励模型或人工标注。实验表明,SOLACE在组合生成、文本渲染和图文对齐等方面均取得了一致性提升,并能与外部奖励结合实现互补改进。

Comments 22 pages, accepted to CVPR 2026. Project page https://wookiekim.github.io/SOLACE/

详情
英文摘要

Text-to-image generation powers content creation across design, media, and data augmentation. Post-training of text-to-image generative models is a promising path to improve human preference alignment, factuality, and aesthetics. We introduce SOLACE (Self-Originating LAtent Confidence Estimation), a post-training framework that replaces external reward supervision with an internal self-confidence signal: we re-noise the model's own outputs and measure how accurately it recovers the injected noise, treating low reconstruction error as high self-confidence. SOLACE converts this intrinsic signal into scalar rewards for reinforcement learning, requiring no external reward models, annotators, or preference data. By reinforcing high-confidence generations, SOLACE delivers consistent gains in compositional generation, text rendering, and text-image alignment. Integrating SOLACE with external rewards yields complementary improvements while alleviating reward hacking.

2603.00166 2026-05-12 cs.CV cs.AI

Exploring the AI Obedience: Why is Generating a Pure Color Image Harder than CyberPunk?

Hongyu Li, Kuan Liu, Yuan Chen, Juntao Hu, Huimin Lu, Guanjie Chen, Xue Liu, Guangming Lu, Hong Huang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) City University of Hong Kong(城市大学) Chinese University of Hong Kong(香港中文大学) McGill University(麦吉尔大学)

AI总结 本文探讨了生成式AI在执行简单任务时表现出的“简洁性悖论”,即模型在生成复杂场景时表现优异,却难以完成如生成纯色图像等简单任务。研究提出“AI服从性”概念,构建了一个分层评估框架,并设计了首个系统性基准Violin,用于评估模型从概率近似到像素级确定性的转换能力。实验表明,闭源模型在确定性任务上的表现优于开源模型,且其性能与自然图像生成能力存在相关性,为理解模型指令对齐问题提供了基础框架和工具。

详情
英文摘要

Recent advances in generative AI have shown human-level performance in complex content creation. However, we identify a "Paradox of Simplicity": models that can render complex scenes often fail at trivial, low-entropy tasks, such as generating a uniform pure color image. We argue this is a systemic failure related to uncontrollable emergent abilities. As models scale, strong priors for aesthetics and complexity override deterministic simplicity, creating an "aesthetic bias" that hinders the model's transition from data simulation to true intellectual abstraction. To better investigate this problem, we formalize the concept of AI Obedience, a hierarchical framework that grades a model's ability to transition from probabilistic approximation to pixel-level determinism (Levels 1 to 5). We introduce Violin, the first systematic benchmark designed to evaluate Level 4 Obedience through three deterministic tasks: color purity, image masking, and geometric shape generation. Using Violin, we evaluate several state-of-the-art models and reveal that closed-source models generally outperform open-source ones in deterministic precision. Interestingly, performance on our benchmark correlates with the benchmark in natural image generation. Our work provides a foundational framework and tools for achieving better alignment between human instructions and model outputs.

2602.22508 2026-05-12 cs.AI

Metacognitive Behavioral Tuning of Large Language Models for Multi-Hop Question Answering

Ik-hwan Kim, Hyeongrok Han, Mingi Jung, Sangwon Yu, Jinseok Hong, Sang Hun Kim, Yoonyoung Choi, Sungroh Yoon

发表机构 * Dept. of ECE Seoul Nat’l Univ.(首尔国立大学电子工程系) AI Center Samsung Elec. Korea(三星电子韩国人工智能中心) AIIS, ASRI, INMC, ISRC, Interdisc. Prog. in AI Seoul Nat’l Univ.(首尔国立大学人工智能研究所、高级研究机构、人工智能中心、国际研究所以及人工智能跨学科研究计划)

AI总结 本文研究了大语言模型在多跳问答任务中即使已有正确中间结论仍会给出错误答案的问题,认为其根源在于模型自我调节能力不足。为此,提出了一种名为“元认知行为调优”(MBT)的后训练框架,通过注入五阶段元认知结构来增强推理过程的自我调控能力。实验表明,MBT在多个基准数据集上取得了最高的准确率-效率得分,同时显著缩短了推理轨迹长度并减少了冗余,验证了其结构先验的有效性。

Comments 41 pages

详情
英文摘要

Large Language Models (LLMs) often produce incorrect answers on multi-hop question answering even when the reasoning trace already contains a correct intermediate conclusion. We attribute this gap to weak self-regulation rather than insufficient reasoning capacity. Without explicit regulation, valid intermediate conclusions are overridden by continued exploration or left unrecognized as logically sufficient. We propose Metacognitive Behavioral Tuning (MBT), a post-training framework that injects a five-phase metacognitive structure into reasoning traces. The five phases are understanding and filtering, planning, execution and monitoring, self-correction, and verification. MBT has two formulations. MBT-S synthesizes new metacognitive traces from scratch, while MBT-R rewrites the student's own traces into a metacognitive form. Across HotpotQA, MuSiQue, and 2WikiMultiHopQA, MBT attains the highest Accuracy-Efficiency Score (AES) across model scales. MBT lifts task accuracy while keeping traces short and stable, with mean response length on MuSiQue an order of magnitude shorter than baseline methods and degeneration counts reduced by a similar margin. A matched-control study further confirms that the gain stems from the five-phase structural prior itself. To qualitatively assess the regulatory behavior of reasoning traces, we introduce two new metrics, the Reach-Redundancy Profile (RRP) and the length-aware Metacognitive Quality Index (MQI). RRP captures when the answer is reached and how much of the trace is redundant, and MQI quantifies how richly the five phases appear. Under both metrics, MBT achieves the earliest answer arrival, the lowest redundancy, and the richest phase-level behavior across model scales.

2602.21581 2026-05-12 cs.CV

MultiAnimate: Pose-Guided Image Animation Made Extensible

Yingcheng Hu, Haowen Gong, Chuanguang Yang, Zhulin An, Yongjun Xu, Songhua Liu

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences(人工智能安全国家重点实验室,计算技术研究所,中国科学院) ShanghaiTech University(上海科技大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种可扩展的多角色图像动画框架 MultiAnimate,旨在解决基于姿势引导的多角色视频生成中身份混淆和不合理遮挡的问题。该方法基于现代扩散变换器(DiT),引入了身份分配器和身份适配器两个关键组件,用于捕捉个体位置信息和角色间空间关系,从而提升模型的灵活性和泛化能力。实验表明,该方法在多角色图像动画任务中取得了优于现有扩散模型的最先进性能。

Comments CVPR2026 Accepted. Project page at https://hyc001.github.io/MultiAnimate/

详情
英文摘要

Pose-guided human image animation aims to synthesize realistic videos of a reference character driven by a sequence of poses. While diffusion-based methods have achieved remarkable success, most existing approaches are limited to single-character animation. We observe that naively extending these methods to multi-character scenarios often leads to identity confusion and implausible occlusions between characters. To address these challenges, in this paper, we propose an extensible multi-character image animation framework built upon modern Diffusion Transformers (DiTs) for video generation. At its core, our framework introduces two novel components-Identifier Assigner and Identifier Adapter - which collaboratively capture per-person positional cues and inter-person spatial relationships. This mask-driven scheme, along with a scalable training strategy, not only enhances flexibility but also enables generalization to scenarios with more characters than those seen during training. Remarkably, trained on only a two-character dataset, our model generalizes to multi-character animation while maintaining compatibility with single-character cases. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in multi-character image animation, surpassing existing diffusion-based baselines.

2602.17283 2026-05-12 cs.CL cs.AI

Towards Cross-lingual Values Judgment: A Consensus-Pluralism Perspective

Yukun Chen, Xinyu Zhang, Boyi Deng, Jialong Tang, Yu Wan, Fei Huang, Yuxi Zhou, Baosong Yang, Yiming Li

发表机构 * State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security(杭州高科技园区(滨江)区块链与数据安全研究院) Tongyi Lab, Alibaba Group(阿里集团通义实验室) Nanyang Technological University(南洋理工大学)

AI总结 随着大语言模型在全球范围内的广泛应用,现有对其多语言能力的评估体系主要关注事实任务表现,忽视了跨语言内容深层价值观判断的能力。为此,本文从文化多样性和学科复杂性两个核心挑战出发,提出了一种两阶段的人机协作标注框架,构建了首个跨语言价值观判断基准X-Value,包含14种语言的4750个问答对及12项细粒度标注元数据,系统评估了17个大语言模型在跨语言价值观判断任务中的表现,揭示了其在不同类别和语言间的性能差异,突显了提升模型价值观判断能力的迫切性。

详情
英文摘要

As large language models (LLMs) are employed worldwide, existing evaluation paradigms for their multilingual capabilities primarily focus on factual task performance, neglecting the ability to judge content's deep-level values across multiple languages. To bridge this gap, we first reveal two primary challenges in constructing values judgment benchmarks, cultural diversity and disciplinary complexity, and propose a novel two-stage human-AI collaborative annotation framework to alleviate them. This framework identifies the issue scope and nature, establishes specific annotation criteria, and utilizes multiple LLMs for final review. Building upon this framework, we introduce \textbf{X-Value}, the first \textit{Cross-lingual Values Judgment Benchmark} designed to evaluate the capability of LLMs in judging deep-level values of content. X-Value comprises 4,750 Question-Answer pairs across 14 languages, covering 7 major global issue categories, and provides 12 granular annotation metadata to facilitate a rigorous evaluation of model performance. Systematic evaluations of X-Value are conducted across 17 LLMs using distinct prompting strategies. Multi-dimensional analysis of accuracy and F1-scores reveals their limitations in cross-lingual values judgment and indicates performance disparities across categories and languages. This work highlights the urgent need to improve the underlying, values-aware content judgment capability of LLMs.\footnote{Samples of X-Value are available at https://huggingface.co/datasets/Whitolf/X-Value.}

2602.11824 2026-05-12 cs.AI cs.LG

Revis: Sparse Latent Steering to Mitigate Object Hallucination in Large Vision-Language Models

Jialin Wu, Wei Shi, Han Shen, Peigui Qi, Kunsheng Tang, Zhicong Huang, Binghao Wang, Zhou Yang

发表机构 * Ant Group(蚂蚁集团)

AI总结 尽管大视觉语言模型(LVLMs)具备强大的能力,但常常出现物体幻觉问题。为了解决这一问题,本文提出了一种无需训练的框架REVIS,通过潜空间几何方法提取纯净的视觉信息,并在抑制发生的特定网络深度进行稀疏干预,从而有效恢复视觉信息并减少计算成本。实验表明,REVIS在标准基准上将物体幻觉率降低了约19%,同时保持了模型的通用推理能力。

Comments Accepted by ICML 2026

详情
英文摘要

Despite the advanced capabilities of Large Vision-Language Models (LVLMs), they frequently suffer from object hallucination. One reason is that visual features and pretrained textual representations often become intertwined in the deeper network layers. To address this, we propose REVIS, a training-free framework designed to explicitly re-activate this suppressed visual information. Rooted in latent space geometry, REVIS extracts the pure visual information vector via orthogonal projection and employs a calibrated strategy to perform sparse intervention only at the precise depth where suppression occurs. This surgical approach effectively restores visual information with minimal computational cost. Empirical evaluations on standard benchmarks demonstrate that REVIS reduces object hallucination rates by approximately 19% compared to state-of-the-art baselines, while preserving general reasoning capabilities.

2602.11181 2026-05-12 cs.CL

Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs

Himanshu Gupta, Pratik Jayarao, Chaitanya Dwivedi, Neeraj Varshney

发表机构 * Arizona State University(亚利桑那州立大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文探讨了代码混合(Code-Mixing)和代码转换(Code-Switching)在大型语言模型(LLMs)中的挑战,指出尽管多语言建模取得进展,但模型在混合语言场景下仍存在语法、事实性和安全性方面的系统性退化。研究提出了一个统一的分类体系,涵盖数据、建模和评估等多个维度,并总结出一套实用指南,帮助构建和评估具备代码混合能力的LLMs。同时,文章分析了当前评估方法的不足,指出了现有基准的局限性,并探讨了代码混合可能被用于绕过模型安全机制等新兴安全问题。

Comments 8 pages main paper, 13 pages total

详情
英文摘要

Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and we catalog existing benchmarks while critically examining their linguistic coverage and English-centric biases. Finally, we discuss emerging safety concerns, including use of code-mixing as a mechanism for bypassing model safeguards, and identify open research challenges.

2602.10356 2026-05-12 cs.CL

Autonomous Continual Learning for Environment Adaptation of Computer-Use Agents

Tianci Xue, Zeyi Liao, Tianneng Shi, Zilu Wang, Kai Zhang, Dawn Song, Yu Su, Huan Sun

发表机构 * The Ohio State University(俄亥俄州立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文研究了计算机使用代理(CUA)在高度多样和动态的现实数字环境中持续学习适应的问题,核心挑战在于如何在无需人工标注数据的情况下获得高质量的训练数据。为此,作者提出了ACuRL框架,通过自主课程强化学习实现零人工数据下的持续环境适应,结合任务生成器和自动评估器CUAJudge,有效提升了代理在环境内和跨环境中的学习性能,并在多个任务上取得了显著的性能提升。

Comments 28 pages, 10 figures

详情
英文摘要

Real-world digital environments are highly diverse and dynamic. These characteristics cause agents to frequently encounter unseen environments and distribution shifts, making continual learning in such environments essential for computer-use agents (CUAs). However, a key challenge lies in obtaining high-quality and environment-grounded training data without relying on costly human annotation. In this work, we introduce ACuRL, an Autonomous Curriculum Reinforcement Learning framework that continually adapts agents to specific environments with zero human data. The agent first explores an environment to acquire initial experiences. During subsequent iterative training, a curriculum task generator leverages these experiences together with feedback from the previous iteration to synthesize new tasks tailored for the agent's current capabilities. To provide reliable reward signals, we introduce CUAJudge, a robust automatic evaluator for CUAs that achieves 93% agreement with human judgments. Empirically, our method effectively enables both intra-environment and cross-environment continual learning, yielding 3-29% absolute performance gains on the target environments without catastrophic forgetting on others. We also show that it can mitigate performance degradation under environment changes (e.g., version updates, platform migration, and resolution shifts). Further analyses show highly sparse updates (e.g., only 20% parameters), which helps explain the effective and robust adaptation.

2602.09534 2026-05-12 cs.CV

AUHead: Realistic Emotional Talking Head Generation via Action Units Control

Jiayi Lyu, Leigang Qu, Wenjing Zhang, Hanyu Jiang, Kai Liu, Zhenglin Zhou, Xiaobo Xia, Jian Xue, Tat-Seng Chua

发表机构 * University of the Chinese Academy of Sciences(中国科学院大学) National University of Singapore(新加坡国立大学) Zhejiang University(浙江大学) State Key Laboratory of Communication Content Cognition, People’s Daily Online(人民日報網通信內容認知重點實驗室)

AI总结 本文提出了一种名为 AUHead 的新方法,用于生成具有真实情感表达的说话人视频。该方法通过解耦音频与细粒度情感单元(Action Units, AUs)的控制,实现了对情绪表达的精确调控。研究采用两阶段框架,第一阶段利用大语言模型生成 AUs 序列,第二阶段基于 AUs 驱动的扩散模型生成高质量的视频,有效提升了情感真实性和视觉一致性。

Comments https://openreview.net/forum?id=dmzlAUkulz&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DICLR.cc%2F2026%2FConference%2FAuthors%23your-submissions) Accepted at the 14th International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an "emotion-then-AU" chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at https://github.com/laura990501/AUHead_ICLR

2602.09016 2026-05-12 cs.CV

Raster2Seq: Polygon Sequence Generation for Floorplan Reconstruction

Hao Phung, Hadar Averbuch-Elor

发表机构 * Cornell University(康奈尔大学)

AI总结 本文提出了一种名为 Raster2Seq 的方法,用于从栅格化的平面图图像中重建结构化的矢量图形表示。该方法将平面图重建视为序列到序列的任务,将房间、窗户和门等元素表示为包含几何和语义信息的带标签多边形序列。通过引入基于可学习锚点的自回归解码器,模型能够根据图像特征和已生成的顶点预测下一个顶点,从而更有效地生成复杂且具有多样多边形结构的平面图。实验表明,该方法在多个标准数据集上取得了最先进的性能,并在更具挑战性的数据集上也表现出良好的泛化能力。

Comments Accepted to SIGGRAPH 2026. Project page: https://cornell-vailab.github.io/Raster2Seq/

详情
英文摘要

Reconstructing a structured vector-graphics representation from a rasterized floorplan image is typically an important prerequisite for computational tasks involving floorplans such as automated understanding or CAD workflows. However, existing techniques struggle in faithfully generating the structure and semantics conveyed by complex floorplans that depict large indoor spaces with many rooms and a varying numbers of polygon corners. To this end, we propose Raster2Seq, framing floorplan reconstruction as a sequence-to-sequence task in which floorplan elements--such as rooms, windows, and doors--are represented as labeled polygon sequences that jointly encode geometry and semantics. Our approach introduces an autoregressive decoder that learns to predict the next corner conditioned on image features and previously generated corners using guidance from learnable anchors. These anchors represent spatial coordinates in image space, hence allowing for effectively directing the attention mechanism to focus on informative image regions. By embracing the autoregressive mechanism, our method offers flexibility in the output format, enabling for efficiently handling complex floorplans with numerous rooms and diverse polygon structures. Our method achieves state-of-the-art performance on standard benchmarks such as Structure3D, CubiCasa5K, and Raster2Graph, while also demonstrating strong generalization to more challenging datasets like WAFFLE, which contain diverse room structures and complex geometric variations.

2602.06733 2026-05-12 cs.LG cs.AI cs.MA

Pairwise is Not Enough: Hypergraph Neural Networks for Multi-Agent Pathfinding

Rishabh Jain, Keisuke Okumura, Michael Amir, Pietro Lio, Amanda Prorok

发表机构 * University of Cambridge(剑桥大学) National Institute of Advanced Industrial Science and Technology(国家先进工业科学与技术研究院)

AI总结 多智能体路径规划(MAPF)是一个典型的多智能体协作问题,要求多个智能体在不发生碰撞的情况下分别到达目标位置。现有基于图神经网络(GNN)的方法通常仅限于两两之间的信息传递,难以有效捕捉多智能体之间的高阶交互,导致在密集环境中表现不佳。为此,本文提出了一种新的超图注意力网络 HMAGAT,通过有向超图上的注意力机制显式建模群体动态,有效缓解了注意力稀释问题,并在更少的训练数据和更少参数的情况下取得了优于现有最优方法的性能。

Comments Published at ICLR 2026

详情
英文摘要

Multi-Agent Path Finding (MAPF) is a representative multi-agent coordination problem, where multiple agents are required to navigate to their respective goals without collisions. Solving MAPF optimally is known to be NP-hard, leading to the adoption of learning-based approaches to alleviate the online computational burden. Prevailing approaches, such as Graph Neural Networks (GNNs), are typically constrained to pairwise message passing between agents. However, this limitation leads to suboptimal behaviours and critical issues, such as attention dilution, particularly in dense environments where group (i.e. beyond just two agents) coordination is most critical. Despite the importance of such higher-order interactions, existing approaches have not been able to fully explore them. To address this representational bottleneck, we introduce HMAGAT (Hypergraph Multi-Agent Attention Network), a novel architecture that leverages attentional mechanisms over directed hypergraphs to explicitly capture group dynamics. Empirically, HMAGAT establishes a new state-of-the-art among learning-based MAPF solvers: e.g., despite having just 1M parameters and being trained on 100$\times$ less data, it outperforms the current SoTA 85M parameter model. Through detailed analysis of HMAGAT's attention values, we demonstrate how hypergraph representations mitigate the attention dilution inherent in GNNs and capture complex interactions where pairwise methods fail. Our results illustrate that appropriate inductive biases are often more critical than the training data size or sheer parameter count for multi-agent problems.

2602.06382 2026-05-12 cs.RO

Now You See That: Learning End-to-End Humanoid Locomotion from Raw Pixels

Wandong Sun, Yongbo Su, Leoric Huang, Alex Zhang, Dwyane Wei, Mu San, Daniel Tian, Ellie Cao, Baoshi Cao, Yang Liu, Finn Yan, Ethan Xie, Zongwu Xie

发表机构 * Harbin Institute of Technology(哈尔滨工程大学) HONOR Robotics Team(HONOR机器人团队)

AI总结 该研究旨在解决基于视觉的人形机器人行走任务中面临的仿真到现实的迁移难题和复杂地形适应问题。为应对感知噪声和多地形学习目标冲突的挑战,作者提出了一种端到端的视觉驱动框架,包含高保真深度传感器仿真和视觉感知行为蒸馏方法,以提升现实环境中的鲁棒性;同时引入地形特定的奖励塑造与多评判器学习机制,增强机器人在不同地形下的适应能力。实验表明,该方法在多种人形机器人平台上表现出优异的通用性和应对复杂任务的能力。

详情
英文摘要

Achieving robust vision-based humanoid locomotion remains challenging due to two fundamental issues: the sim-to-real gap introduces significant perception noise that degrades performance on fine-grained tasks, and training a unified policy across diverse terrains is hindered by conflicting learning objectives. To address these challenges, we present an end-to-end framework for vision-driven humanoid locomotion. For robust sim-to-real transfer, we develop a high-fidelity depth sensor simulation that captures stereo matching artifacts and calibration uncertainties inherent in real-world sensing. We further propose a vision-aware behavior distillation approach that combines latent space alignment with noise-invariant auxiliary tasks, enabling effective knowledge transfer from privileged height maps to noisy depth observations. For versatile terrain adaptation, we introduce terrain-specific reward shaping integrated with multi-critic and multi-discriminator learning, where dedicated networks capture the distinct dynamics and motion priors of each terrain type. We validate our approach on two humanoid platforms equipped with different stereo depth cameras. The resulting policy demonstrates robust performance across diverse environments, seamlessly handling extreme challenges such as high platforms and wide gaps, as well as fine-grained tasks including bidirectional long-term staircase traversal.

2602.05946 2026-05-12 cs.LG stat.ML

f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment

Rajdeep Haldar, Lantao Mei, Guang Lin, Yue Xing, Qifan Song

发表机构 * Department of Statistics, Purdue University(普渡大学统计学系) Department of Statistics, Michigan State University(密歇根州立大学统计学系)

AI总结 本文研究了如何通过基于散度的强化学习算法实现大语言模型的一般对齐,包括基于可验证奖励的强化学习(RLVR)等场景。作者提出了 $f$-GRPO 和 $f$-HAL 两种方法,分别用于基于策略的奖励优化和结合策略与偏好监督的混合对齐损失,证明了它们能够估计奖励对齐与不对齐分布之间的 $f$-散度,并在实验中展示了其在数学推理任务和安全对齐中的优越性。

详情
英文摘要

Recent work shows that preference alignment objectives can be interpreted as divergence estimators between aligned (preferred) & unaligned (less-preferred) distributions, yielding a principled recipe for designing alignment losses. However, this view has so far been limited to preference-based supervision. We extend it to general LLM alignment, including reinforcement learning with verifiable rewards (RLVR), where alignment feedback is given only as scalar rewards. We introduce $f$-Group Relative Policy Optimization ($f$-GRPO), a class of on-policy RL objectives, and $f$-Hybrid Alignment Loss ($f$-HAL), which combines on-policy reward optimization with off-policy preference supervision. We show that these objectives estimate $f$-divergences between reward-aligned & reward-unaligned distributions induced by above- & below-average reward responses, and prove expected reward improvement after alignment. Empirically, $f$-GRPO improves over GRPO on math-reasoning RLVR tasks, while hybrid $f$-HAL mitigates reward hacking in on-policy safety alignment when verifiable rewards are unavailable and learned reward models must be used.

2602.05243 2026-05-12 cs.LG cs.CV

CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers

Boxiang Zhang, Baijian Yang

发表机构 * Purdue University(普渡大学)

AI总结 本文提出CORP,一种无需梯度或微调的闭式单次结构化剪枝方法,用于在Transformer模型中去除多层感知机和注意力子结构。该方法将结构化剪枝建模为表示恢复问题,通过闭式岭回归推导出补偿模型权重的解析解,从而在保持高精度的前提下实现模型的高效压缩。实验表明,CORP在ImageNet数据集上对DeiT模型进行大量剪枝后仍能保持较高的分类准确率。

详情
英文摘要

Transformers achieve strong accuracy but incur high compute and memory cost. Structured pruning reduces inference cost, but most methods rely on retraining or multi-stage optimization, which limits post-training deployment. We propose CORP, a closed-form one-shot structured pruning method that removes MLP dimensions and attention substructures using only unlabeled calibration data without gradients or fine-tuning. CORP formulates structured pruning as a representation recovery problem. It models removed components as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights. This minimizes a layer-local affine/logit reconstruction objective under the calibration distribution. Experiments on ImageNet with DeiT reveal strong redundancy in both MLP and attention representations. With CORP, models retain high accuracy under aggressive sparsity. On DeiT-Huge, CORP achieves 83.27% Top-1 accuracy after pruning 50\% of both MLP and attention structures.

2602.05214 2026-05-12 cs.LG

Disentangled Representation Learning via Flow Matching

Jinjin Chi, Taoping Liu, Mengtao Yin, Ximing Li, Yongcheng Jing, Jialie Shen, Leszek Rutkowski, Dacheng Tao

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) City St George’s, University of London(伦敦大学城市圣乔治学院) Systems Research Institute, Polish Academy of Sciences(波兰科学院系统研究所)

AI总结 本文提出了一种基于流匹配的解耦表征学习框架,将解耦问题转化为在紧凑潜在空间中学习条件流的过程。为实现显式的语义对齐,作者引入了一个非重叠正则化项,以抑制不同因素间的干扰并减少信息泄露。实验表明,该方法在多个数据集上均优于现有基线,取得了更高的解耦度评分以及更好的可控性和样本保真度。

详情
英文摘要

Disentangled representation learning aims to capture the underlying explanatory factors of observed data, enabling a principled understanding of the data-generating process. Recent advances in generative modeling have introduced new paradigms for learning such representations. However, existing diffusion-based methods encourage factor independence via inductive biases, yet frequently lack strong semantic alignment. In this work, we propose a flow matching-based framework for disentangled representation learning, which casts disentanglement as learning factor-conditioned flows in a compact latent space. To enforce explicit semantic alignment, we introduce a non-overlap (orthogonality) regularizer that suppresses cross-factor interference and reduces information leakage between factors. Extensive experiments across multiple datasets demonstrate consistent improvements over representative baselines, yielding higher disentanglement scores as well as improved controllability and sample fidelity.

2602.03916 2026-05-12 cs.CV cs.CE cs.CL cs.LG

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar, Mohsin Mahmud Topu, Sadia Tasnim Meem, Rahatun Nesa Priti, Sabrina Afroz Mitu, Md. Iqramul Hoque, Shahriyar Zaman Ridoy, Mohammed Eunus Ali, Majd Hawasly, Mohammad Raza, Md Rizwan Parvez

发表机构 * Computational Intelligence and Operations Laboratory(计算智能与运筹实验室) Shahjalal University of Science and Technology(沙赫jalal科技大学) BRAC University(BRAC大学) North South University(北南大学) Monash University(墨尔本大学) Qatar Computing Research Institute(卡塔尔计算研究院)

AI总结 SpatiaLab 是一个用于评估视觉语言模型(VLMs)在真实场景中空间推理能力的综合性基准。该研究指出,现有模型在处理复杂的空间关系、深度感知、导航和三维几何等问题时仍存在显著不足。SpatiaLab 包含 1400 个视觉问答对,涵盖六个主要类别及 30 种任务类型,实验表明当前最先进的 VLMs 在空间推理任务上的表现远低于人类。

Comments Accepted to ICLR 2026 (https://openreview.net/forum?id=fWWUPOb0CT). 92 Pages. 42 Figures and 29 Tables

Journal ref ICLR 2026

详情
英文摘要

Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: https://spatialab-reasoning.github.io/.

2602.03783 2026-05-12 cs.LG cs.AI cs.CL

Efficient Estimation of Kernel Surrogate Models for Task Attribution

Zhenshuo Zhang, Minxuan Duan, Hongyang R. Zhang

发表机构 * Northeastern University(东北大学)

AI总结 本文研究如何量化不同训练任务对目标任务性能的影响,即任务归因问题。传统方法如留一法重新训练计算开销大,而现有线性代理模型无法捕捉非线性任务交互。为此,作者提出基于核方法的代理模型,能够更有效地表示二阶任务交互,并设计了一种基于梯度的高效估计方法,无需重复训练即可获得高精度的代理模型。实验表明,核代理模型在多个任务场景中显著优于线性模型和影响函数方法,提升了任务归因的准确性和可扩展性。

Comments 27 pages. Appeared in ICLR 2026

详情
英文摘要

Modern AI agents such as large language models are trained on diverse tasks -- translation, code generation, mathematical reasoning, and text prediction -- simultaneously. A key question is how to quantify the influence of each individual training task on performance on a target task, a problem we refer to as task attribution. The direct approach, leave-one-out retraining, measures the effect of removing each task, but is computationally infeasible at scale. An alternative approach that builds surrogate models to predict the performance on a target task for any subset of training tasks has emerged in the recent literature. Prior work focuses on linear surrogate models, which capture first-order relationships but miss nonlinear interactions such as XOR-type effects. In this paper, we first consider a unified task-weighting framework for analyzing task-attribution methods and establish a new connection between linear surrogate models and influence functions via a second-order analysis. Then, we introduce kernel surrogate models, which more effectively represent second-order task interactions. To efficiently learn the kernel surrogate, we develop a gradient-based estimation procedure that leverages a first-order approximation of pretrained models; empirically, this yields accurate surrogate estimates with less than $2\%$ relative error without repeated retraining. Experiments across multiple settings -- including mathematical reasoning in transformers, in-context learning, and multi-objective reinforcement learning -- demonstrate the effectiveness of kernel surrogate models. They achieve a $25\%$ higher correlation with the leave-one-out ground truth than linear surrogates and influence-function baselines, enabling more accurate and scalable task attribution. When used for downstream data selection, kernel surrogate models further yield a $40\%$ improvement in the aforementioned settings.

2602.01687 2026-05-12 cs.CL cs.AI

Functional Subspace, where language models can use vector algebra to solve problems

Jung H. Lee, Sujith Vijayan

发表机构 * Pacific Northwest National Laboratory(太平洋西北国家实验室) School of Neuroscience(神经科学学院)

AI总结 该研究探讨了大型语言模型(LLMs)在执行任务时是否利用子空间和向量代数进行操作。研究通过分析模型在上下文学习中的功能模块和残差流,发现LLMs能够创建子空间以积累证据,并通过简单的代数运算在这些子空间中解决任务。这一发现为理解LLMs的工作机制和潜在能力提供了新的视角。

Comments page 20, 7 main figures, 8 supplementary figures

详情
英文摘要

Large language models (LLMs) were invented for natural language tasks such as translation, but they have proved that they can perform highly complex functions across domains. Additionally, they have been thought to develop new skills without being trained on them. These learning capabilities lead to LLMs adoption in a wide range of domains. Thus, it is imperative that we understand their operating mechanisms and limitations for proper diagnostics and repair. The earlier studies proposed that high level concepts are encoded as linear directions in LLMs activation space and that the geometry of embeddings have semantic meanings. Inspired by these studies, we hypothesize that LLMs may use subspaces and vector algebra in subspaces to perform tasks. To address this hypothesis, we analyze LLMs' functional modules and residual streams collected from LLMs engaging in in-context learning (ICL), one of the emergent abilities. Our analyses suggest that 1) LLMs can create subspaces, where evidence can be accumulated and 2) ICL tasks can be solved via simple algebraic operations in subspaces.

2602.00986 2026-05-12 cs.CL

Sparse Reward Subsystem in Large Language Models

Guowei Xu, Mert Yuksekgonul, James Zou

发表机构 * Tsinghua University(清华大学) Stanford University(斯坦福大学)

AI总结 近期研究表明,大语言模型的隐藏状态中编码了与奖励相关的信息,如答案正确性和模型置信度。本文发现这些信息主要集中在隐藏状态中一小部分神经元上,并通过简单探针识别出两类神经元:价值神经元和多巴胺神经元,分别编码状态价值和时间差分误差。这一发现揭示了大语言模型中存在一个稀疏的奖励子系统,为理解模型内部奖励机制提供了新的视角,并展示了其在模型置信度预测和推理搜索引导中的应用。

详情
英文摘要

Recent studies show that LLM hidden states encode reward-related information, such as answer correctness and model confidence. However, existing approaches typically fit black-box probes on the full hidden states, offering little insight into how this information is structured across neurons. In this paper, we show that reward-related information is concentrated in a sparse subset of neurons. Using simple probing, we identify two types of neurons: value neurons, whose activations predict state value, and dopamine neurons, whose activations encode step-level temporal difference (TD) errors. Together, these neurons form a sparse reward subsystem within LLM hidden states. These names are drawn by analogy with neuroscience, where value neurons and dopamine neurons in the biological reward subsystem also encode value and reward prediction errors, respectively. We demonstrate that value neurons are robust and transferable across diverse datasets and models, and provide causal evidence that they encode reward-related information. Finally, we show applications of the reward subsystem: value neurons serve as effective predictors of model confidence, and dopamine neurons can function as a process reward model (PRM) to guide inference-time search.

2602.00953 2026-05-12 cs.LG

SAGE: Agentic Framework for Interpretable and Clinically Translatable Computational Pathology Biomarker Discovery

Sahar Almahfouz Nasser, Juan Francisco Pesantez Borja, Jincheng Liu, Sandeep Manandhar, Shikhar Shiromani, Mohammad Tanvir Hasan, Zenghan Wang, Suman Ghosh, Jinchu Li, Xuejian Xu, Aniket Ramkrishnan Iyer, Naoto Tokuyama, Twisha Shah, Tilak Pathak, Soundharya Kumaresan, Yohei Abe, Himanshu Maurya, Anant Madabhushi

发表机构 * Emory University(埃默里大学) Georgia Institute of Technology(佐治亚理工学院) NVIDIA(NVIDIA公司) University of Arkansas at Little Rock(阿拉伯州立大学)

AI总结 SAGE 是一种用于可解释且具有临床转化潜力的计算病理学生物标志物发现的智能代理框架。该方法通过知识图谱引导的假设生成、基于辩论的多代理新颖性评估以及端到端的自动化验证流程,将生物标志物的发现过程建立在坚实的生物学证据之上。研究的核心贡献在于将原本依赖直觉和文献浏览的标志物发现过程转化为结构化、可追溯的推理流程,从而提升其临床可信度和可应用性。

详情
英文摘要

Engineered image-based biomarkers offer a clinically interpretable alternative to black-box AI in computational pathology, yet their discovery remains largely intuition-driven, guided by fragmented literature rather than rigorous biological validation. We introduce SAGE (Structured Agentic system for hypothesis Generation and Evaluation), a multi-agent framework that grounds biomarker discovery in biological evidence through three mechanisms: (i) knowledge-graph-anchored hypothesis generation via multi-path ontological reasoning, (ii) a debate-based multi-agent novelty assessment that stress-tests candidate biomarkers against existing literature, and (iii) an end-to-end automated validation pipeline that translates hypotheses directly into executable analyses on multimodal pathology datasets. Together, these components shift biomarker discovery from an intuition-driven, literature-browsing exercise into a structured, traceable reasoning process that clinicians and researchers can inspect, trust, and build upon.

2602.00877 2026-05-12 cs.RO

Learning When to Jump for Off-road Navigation

Zhipeng Zhao, Taimeng Fu, Shaoshu Su, Qiwei Du, Ehsan Tarkesh Esfahani, Karthik Dantu, Souma Chowdhury, Chen Wang

发表机构 * University at Buffalo(布法罗大学)

AI总结 本文研究了越野导航中如何通过控制速度实现安全跳跃以克服障碍的问题。为了解决现有方法忽视运动动态特性的不足,作者提出了基于运动感知的可通行性(MAT)表示方法,将地形代价建模为速度的高斯函数。该方法通过感知预测地形参数,并在规划过程中高效更新不同速度下的地形代价,从而实现敏捷的越野导航。实验表明,MAT在保证安全性的前提下显著提升了导航性能,减少了75%的路径绕行。

详情
英文摘要

Low speed does not always guarantee safety in off-road driving. For instance, crossing a ditch may be risky at a low speed due to the risk of getting stuck, yet safe at a higher speed with a controlled, accelerated jump. Achieving such behavior requires path planning that explicitly models complex motion dynamics, whereas existing methods often neglect this aspect and plan solely based on positions or a fixed velocity. To address this gap, we introduce Motion-aware Traversability (MAT) representation to explicitly model terrain cost conditioned on actual robot motion. Instead of assigning a single scalar score for traversability, MAT models each terrain region as a Gaussian function of velocity. During online planning, we decompose the terrain cost computation into two stages: (1) predict terrain-dependent Gaussian parameters from perception in a single forward pass, (2) efficiently update terrain costs for new velocities inferred from current dynamics by evaluating these functions without repeated inference. We develop a system that integrates MAT to enable agile off-road navigation and evaluate it in both simulated and real-world environments with various obstacles. Results show that MAT achieves real-time efficiency and enhances the performance of off-road navigation, reducing path detours by 75% while maintaining safety across challenging terrains.

2602.00678 2026-05-12 cs.RO

Toward Reliable Sim-to-Real Predictability for MoE-based Robust Quadrupedal Locomotion

Tianyang Wu, Hanwei Guo, Yuhang Wang, Junshu Yang, Xinyang Sui, Jiayi Xie, Xingyu Chen, Zeyang Liu, Xuguang Lan

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(人机混合增强智能国家重点实验室,人工智能与机器人研究院,西安交通大学)

AI总结 该研究旨在解决基于混合专家(MoE)架构的四足机器人在复杂地形中从仿真到现实的可靠迁移问题。研究提出了一种统一框架,结合了鲁棒多地形表示的MoE运动策略和用于评估仿真到现实迁移能力的RoboGauge预测评估系统。通过仅依靠本体感觉信息,该方法在多种未知复杂地形中实现了稳定且高效的运动,并在高速测试中达到了4米/秒的速度,展示了其优越的性能和泛化能力。

Comments Accepted at Robotics Science and Systems (RSS), 2026. Project Page: https://robogauge.github.io/complete/

详情
英文摘要

Reinforcement learning has shown strong promise for quadrupedal agile locomotion, even with proprioception-only sensing. In practice, however, sim-to-real gap and reward overfitting in complex terrains can produce policies that fail to transfer, while physical validation remains risky and inefficient. To address these challenges, we introduce a unified framework encompassing a Mixture-of-Experts (MoE) locomotion policy for robust multi-terrain representation with RoboGauge, a predictive assessment suite that quantifies sim-to-real transferability. The MoE policy employs a gated set of specialist experts to decompose latent terrain and command modeling, achieving superior deployment robustness and generalization via proprioception alone. RoboGauge further provides multi-dimensional proprioception-based metrics via sim-to-sim tests over terrains, difficulty levels, and domain randomizations, enabling reliable MoE policy selection without extensive physical trials. Experiments on a Unitree Go2 demonstrate robust locomotion on unseen challenging terrains, including snow, sand, stairs, slopes, and 30 cm obstacles. In dedicated high-speed tests, the robot reaches 4 m/s and exhibits an emergent narrow-width gait associated with improved stability at high velocity.

2602.00318 2026-05-12 cs.LG cs.AI cs.CR

Optimal Transport-Guided Adversarial Attacks on Graph Neural Network-Based Bot Detection

Kunal Mukherjee, Zulfikar Alom, Tran Gia Bao Ngo, Cuneyt Gurcan Akcora, Murat Kantarcioglu

发表机构 * Department of Computer Science, Virginia Tech, Virginia, USA(弗吉尼亚理工大学计算机科学系) Department of Computer Science, Manitoba, Canada(曼尼托巴大学计算机科学系) Department of Computer Science, University of Toledo, Ohio, USA(托莱多大学计算机科学系) AI Initiative, University of Central Florida, Florida, USA(佛罗里达大学人工智能倡议)

AI总结 随着社交媒体上机器人账户的增多,基于图神经网络(GNN)的机器人检测方法受到越来越多关注。然而,现有攻击方法在面对现实场景中的领域特定和时间约束时效果有限。为此,本文提出BOCLOAK方法,通过结合最优运输理论,系统评估GNN在边编辑和节点注入攻击下的鲁棒性,并在满足现实约束条件下实现高效的攻击,显著提升了攻击成功率,同时大幅降低了计算资源消耗,为对抗攻击与现实机器人检测之间的桥梁提供了轻量且原理清晰的框架。

Comments Accepted to Proceedings of the Forty-Third International Conference on Machine Learning (ICML) 2026

Journal ref Proceedings of the Forty-Third International Conference on Machine Learning 2026

详情
英文摘要

The rise of bot accounts on social media poses significant risks to public discourse. To address this threat, modern bot detectors increasingly rely on Graph Neural Networks (GNNs). However, the effectiveness of these GNN-based detectors in real-world settings remains poorly understood. In practice, attackers continuously adapt their strategies as well as must operate under domain-specific and temporal constraints, which can fundamentally limit the applicability of existing attack methods. As a result, there is a critical need for robust GNN-based bot detection methods under realistic, constraint-aware attack scenarios. To address this gap, we introduce BOCLOAK to systematically evaluate the robustness of GNN-based social bot detection via both edge editing and node injection adversarial attacks under realistic constraints. BOCLOAK constructs a probability measure over spatio-temporal neighbor features and learns an optimal transport geometry that separates human and bot behaviors. It then decodes transport plans into sparse, plausible edge edits that evade detection while obeying real-world constraints. We evaluate BOCLOAK across three social bot datasets, five state-of-the-art bot detectors, three adversarial defenses, and compare it against four leading graph adversarial attack baselines. BOCLOAK achieves up to 80.13% higher attack success rates while using 99.80% less GPU memory under realistic real-world constraints. Most importantly, BOCLOAK shows that optimal transport provides a lightweight, principled framework for bridging the gap between adversarial attacks and real-world bot detection.

2601.23273 2026-05-12 cs.CL

UPA: Unsupervised Prompt Agent via Tree-Based Search and Selection

Siran Peng, Weisong Zhao, Tianyu Fu, Chenxu Zhao, Tianshuo Zhang, Haoyuan Zhang, Xiangyu Zhu, Minghui Wu, Zhen Lei

发表机构 * MAIS, CASIA(中国科学院自动化研究所) SAI, UCAS(中国科学技术大学人工智能学院) Mininglamp Technology(Mininglamp技术公司) IIE, CAS(中国科学院信息研究所) SCSE, FIE, M.U.S.T(慕斯科技大学信息工程系)

AI总结 本文提出了一种无需监督奖励信号的提示优化方法UPA,通过树结构搜索与选择实现结构化提示空间的探索。UPA利用大型语言模型进行细粒度、位置偏差校正的成对比较,结合基于Bradley-Terry-Luce模型的两阶段框架,分别进行局部比较的贝叶斯聚合与全局竞赛式比较,从而在无监督环境下有效识别最优提示。实验表明,UPA在多个任务中优于现有方法,验证了其在无监督场景下的有效性。

详情
英文摘要

Prompt agents have recently emerged as a promising paradigm for automated prompt optimization, framing prompt discovery as a sequential decision-making problem over a structured prompt space. While this formulation enables the use of advanced planning algorithms, these methods typically assume access to supervised reward signals, which are often unavailable in practical scenarios. In this work, we propose UPA, an Unsupervised Prompt Agent that realizes structured search and selection without relying on ground-truth (GT) rewards. Specifically, during search, UPA iteratively constructs an evolving tree structure to navigate the prompt space, guided by fine-grained and position-debiased pairwise comparisons from Large Language Models (LLMs). Crucially, as these local comparisons do not inherently yield a consistent global scale, we decouple systematic prompt exploration from final selection, introducing a two-stage framework grounded in the Bradley-Terry-Luce (BTL) model. This framework first performs path-wise Bayesian aggregation of local comparisons to filter candidates under uncertainty, followed by global tournament-style comparisons to infer latent prompt quality and identify the optimal prompt. Experiments across multiple tasks demonstrate that UPA consistently outperforms existing prompt optimization methods, showing that agent-style optimization can remain highly effective even in unsupervised settings.

2601.22320 2026-05-12 cs.LG stat.ML

Matrix Factorization for Practical Continual Mean Estimation Under User-Level Differential Privacy

Nikita P. Kalinin, Ali Najar, Valentin Roth, Christoph H. Lampert

发表机构 * Institute of Science and Technology Austria (ISTA)(奥地利科学与技术研究院)

AI总结 本文研究了在用户级差分隐私保护下的连续均值估计问题,即在数据向量依次到达的情况下,如何持续准确地估计累积均值。为了解决这一问题,作者采用近似差分隐私框架,并结合矩阵分解机制,提出了一种专门用于均值估计的矩阵分解方法,该方法在保证隐私的同时,能够显著降低均值估计的均方误差,提升了实际应用中的估计精度与效率。

详情
英文摘要

We study continual mean estimation, where data vectors arrive sequentially and the goal is to maintain accurate estimates of the running mean. We address this problem under user-level differential privacy, which protects each user's entire dataset even when they contribute multiple data points. Previous work on this problem has focused on pure differential privacy. While important, this approach limits applicability, as it leads to overly noisy estimates. In contrast, we analyze the problem under approximate differential privacy, adopting recent advances in the Matrix Factorization mechanism. We introduce a novel mean estimation specific factorization, which is both efficient and accurate, achieving asymptotically lower mean-squared error bounds in continual mean estimation under user-level differential privacy.

2601.21699 2026-05-12 cs.CL

Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

Hojae Han, Heeyun Jung, Jongyoon Kim, Seung-won Hwang

发表机构 * ETRI(电子技术研究所) The University of Hong Kong(香港大学) Seoul National University(首尔国立大学)

AI总结 本文研究了在资源受限条件下,如何提升多跳推理代理的训练效率与效果。作者提出了一种名为 David-GRPO 的新方法,通过引入专家引导和证据引导的探索策略,有效利用小批量数据进行强化学习,从而提高推理深度和证据覆盖度。实验表明,在相同低预算条件下,该方法在多个多跳问答基准测试中优于现有强化学习基线。

Comments Preprint

详情
英文摘要

Multi-turn reasoning agents solve complex questions by decomposing them into intermediate retrieval or tool-use steps, for accumulating supporting evidence across turns. Meanwhile, with reinforcement learning (RL), training these agents rely on many on-policy rollouts and large training batches. Under realistic resource constraints that make dense exploration infeasible, each RL batch contains only few useful reasoning paths from the current policy. Existing approaches do not fully address this bottleneck: SFT-based initialization can overfit when annotated trajectories are scarce, retrieval-level rewards can assign credit to individual retrieved documents without directly optimizing coverage of the full evidence set, and expansion can waste rollouts from poorly chosen prefixes. We introduce David-GRPO, which improves small-batch learning by using information from both outside and inside the current policy: (i) expert bootstrapping injects a few off-policy expert trajectories into RL updates, and (ii) evidence-guided exploration turns on-policy partial successes into evidence-coverage scores and additional continuations. On agents up to 1.5B parameters trained on four RTX 3090 GPUs, David-GRPO improves over prior RL baselines under the same low-budget setting on six multi-hop QA benchmarks. The gains come with a behavioral shift: unlike prior low-budget RL baselines that often skip retrieval or stop after shallow search, David-GRPO learns to increase retrieval depth and evidence coverage.

2601.18823 2026-05-12 cs.LG

VAE with Hyperspherical Coordinates: Improving Anomaly Detection from Hypervolume-Compressed Latent Space

Alejandro Ascarate, Leo Lebrat, Rodrigo Santa Cruz, Clinton Fookes, Olivier Salvado

发表机构 * School of Electrical Engineering and Robotics(电气工程与机器人学学院) Queensland University of Technology(昆士兰理工大学) Data61 CSIRO(澳大利亚联邦科学与工业研究组织)

AI总结 本文研究了如何通过引入超球坐标系改进变分自编码器(VAE)在异常检测中的性能。传统VAE在高维潜在空间中难以有效检测异常,因为潜在向量倾向于分布在超球体的“赤道”区域,导致检测困难。作者提出将潜在变量表示为超球坐标,从而压缩潜在空间体积并增强后验分布的表达能力,最终在多个真实世界和标准数据集上显著提升了无条件和有条件异常检测的效果。

详情
英文摘要

Variational autoencoders (VAE) encode data into lower-dimensional latent vectors before decoding those vectors back to data. Once trained, one can hope to detect out-of-distribution (abnormal) latent vectors, but several issues arise when the latent space is high dimensional. This includes an exponential growth of the hypervolume with the dimension, which severely affects the generative capacity of the VAE. In this paper, we draw insights from high dimensional statistics: in these regimes, the latent vectors of a standard VAE are distributed on the `equators' of a hypersphere, challenging the detection of anomalies. We propose to formulate the latent variables of a VAE using hyperspherical coordinates, which allows compressing the latent vectors towards a given direction on the hypersphere, thereby allowing for a more expressive approximate posterior. We show that this improves both the fully unconditional-OOD and conditional-OOD anomaly detection ability of the VAE, achieving the best performance on the datasets we considered, outperforming existing methods. For the unconditional-OOD and conditional-OOD modalities, respectively, these are: i) detecting unusual landscape from the Mars Rover camera and unusual Galaxies from ground based imagery (complex, real world datasets); ii) standard benchmarks like Cifar10 and subsets of ImageNet as the in-distribution (ID) class.

2601.15065 2026-05-12 cs.CV

Enhancing Few-Shot Out-of-Distribution Detection via the Refinement of Foreground and Background

Tianyu Li, Zongqian Wu, Songyue Cai, Ping Hu, Xiaofeng Zhu

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院) School of Computer Science and Technology, Hainan University(海南大学计算机科学与技术学院)

AI总结 该论文针对少样本分布外检测(Few-Shot OOD Detection)中前景-背景分解方法的不足,提出了一种新的即插即用框架。该方法通过自适应背景抑制和可混淆前景修正两个核心模块,分别优化背景区域的分类熵权重和修正与其它类别相似的前景区域,从而提升检测性能。实验表明,该框架有效提升了现有方法在少样本场景下的分布外检测能力。

Comments arXiv preprint arXiv:2601.15065 (2026)

详情
英文摘要

CLIP-based foreground-background (FG-BG) decomposition methods have demonstrated remarkable effectiveness in improving few-shot out-of-distribution (OOD) detection performance. However, existing approaches still suffer from several limitations. For background regions obtained from decomposition, existing methods adopt a uniform suppression strategy for all patches, overlooking the varying contributions of different patches to the prediction. For foreground regions, existing methods fail to adequately consider that some local patches may exhibit appearance or semantic similarity to other classes, which may mislead the training process. To address these issues, we propose a new plug-and-play framework. This framework consists of three core components: (1) a Foreground-Background Decomposition module, which follows previous FG-BG methods to separate an image into foreground and background regions; (2) an Adaptive Background Suppression module, which adaptively weights patch classification entropy; and (3) a Confusable Foreground Rectification module, which identifies and rectifies confusable foreground patches. Extensive experimental results demonstrate that the proposed plug-and-play framework significantly improves the performance of existing FG-BG decomposition methods. Code is available at: https://github.com/lounwb/FoBoR.