arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
专题追踪
2508.04660 2026-05-12 cs.CL

Composing Policy Gradients and Prompt Optimization for Language Model Programs

Noah Ziems, Dilara Soylu, Lakshya A Agrawal, Isaac Miller, Liheng Lai, Chen Qian, Kaiqiang Song, Meng Jiang, Dan Klein, Matei Zaharia, Karel D'Oosterlinck, Christopher Potts, Omar Khattab

发表机构 * University of Notre Dame(诺特大学) Stanford University(斯坦福大学) UC Berkeley(伯克利大学) Anyscale CMU(卡内基梅隆大学) Zoom, Inc.(Zoom公司) Contextual AI MIT(麻省理工学院)

AI总结 本文研究了如何将组相对策略优化(GRPO)应用于由多个语言模型调用组成的模块化程序系统,以提升其性能。作者提出了一种多模块GRPO方法,通过模块级或轨迹级分组实现策略梯度优化,并发现其能有效与自动提示优化结合,显著提升模型在分类、多跳搜索和隐私保护任务中的表现。实验表明,该方法在多种任务上平均提升了11%的准确率,优于单独使用提示优化。

Comments ACM CAIS 2026. Lakshya*, Dilara*, and Noah* contributed equally to this work

详情
英文摘要

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how practitioners can best leverage online RL algorithms like GRPO to improve these systems. We begin to address this challenge by investigating whether it is possible to effectively instantiate GRPO for arbitrary multi-prompt programs and whether it can work robustly as an off-the-shelf optimizer for LM programs using the same abstractions and constraints typically involved for prompt optimization. Our main variant of multi-module GRPO constructs groups from module-level invocations, and we also consider trajectory-level grouping as another natural instantiation. We find for the first time that GRPO (and its multi-module counterpart) empirically composes well with automatic prompt optimization, and together they improve accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM - with 5% gains against prompt optimization on its own. We open-source multi-module GRPO in the DSPy library at https://dspy.ai .

2507.23009 2026-05-12 cs.LG cs.AI

Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

Tom Sühr, Florian E. Dorner, Olawale Salaudeen, Augustin Kelava, Samira Samadi

发表机构 * Max Planck Institute for Intelligent Systems, Tübingen(马克斯·普朗克智能系统研究所,图宾根) ETH Zurich, Zurich(苏黎世联邦理工学院,苏黎世) University of Tübingen, Methods Center, Tübingen(图宾根大学,方法中心,图宾根) Massachusetts Institute of Technology, Cambridge(麻省理工学院,剑桥)

AI总结 本文指出,目前使用针对人类设计的认知和心理测试来评估大型语言模型(LLMs)的做法存在根本性错误,因为这些测试是为特定人类群体设计的理论驱动工具,直接用于评估AI可能产生误导。文章认为,将AI在基准测试中的表现解释为人类特质如“智能”的测量,缺乏充分的理论和实证依据,并呼吁停止使用人类测试,转而开发针对AI特性的原理性评估框架,以更准确地衡量AI系统的能力。

详情
英文摘要

Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific human population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence'', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. We call for the development of principled, AI-specific evaluation frameworks tailored to AI systems. Such frameworks might build on existing frameworks for constructing and validating psychometrics tests, or could be created entirely from scratch to fit the unique context of AI.

2507.18847 2026-05-12 cs.RO cs.AI

Equivariant Volumetric Grasping

Pinhao Song, Yutong Hu, Pengteng Li, Renaud Detry

发表机构 * KU Leuven, Dept. Mechanical Engineering, Research unit Robotics, Automation and Mechatronics(库勒芬大学机械工程系,机器人、自动化与机电一体化研究单元) KU Leuven, Dept. Electrical Engineering, Research unit Processing Speech and Images(库勒芬大学电气工程系,语音与图像处理研究单元) AI Thrust, HKUST(GZ)(人工智能推动部门,香港科技大学(广州))

AI总结 本文提出了一种对垂直轴旋转具有等变性质的体素抓取模型,显著提升了采样效率。该模型采用三平面体素特征表示方法,并设计了新的三平面特征结构,使得水平面上的特征对90度旋转具有等变性,而其他两平面特征之和对反射保持不变。基于此,作者进一步开发了两种先进的体素抓取规划器的等变版本,并通过大量仿真和真实实验验证了方法的有效性,结果显示该方法在计算和内存成本上均有降低,且在实时约束下性能优于非等变模型。

Comments 21 pages

详情
英文摘要

We propose a new volumetric grasp model that is equivariant to rotations around the vertical axis, leading to a significant improvement in sampling efficiency. Our model employs a tri-plane volumetric feature representation -- i.e., the projection of 3D features onto three canonical planes. We introduce a novel tri-plane feature design in which features on the horizontal plane are \emph{equivariant} to $90^\circ$ rotations, while the \emph{sum} of features from the other two planes remains \emph{invariant} to reflections induced by the same transformations. We further develop equivariant adaptations of two state-of-the-art volumetric grasp planners, GIGA and IGD. Specifically, we derive a new equivariant formulation of IGD's deformable attention mechanism and propose an equivariant generative model of grasp orientations based on flow matching. We provide a detailed analytical justification of the proposed equivariance properties and validate our approach through extensive simulated and real-world experiments. Our results demonstrate that the proposed projection-based design reduces both computational and memory costs. Moreover, the equivariant grasp models built on top of our tri-plane features consistently outperform their non-equivariant counterparts, achieving higher performance within a real-time cost constraint. Video and code can be viewed in: https://mousecpn.github.io/evg-page/

2507.15518 2026-05-12 cs.AI cs.MA

HAMLET: A Hierarchical and Adaptive Multi-Agent Framework for Live Embodied Theatrics

Shufan Jiang, Sizhou Chen, Chios Chen, Chi Zhang, Xiao-Lei Zhang, Xuelong Li

发表机构 * East China University of Science and Technology(东华大学) The University of Sydney(悉尼大学) Institute of Artificial Intelligence (TeleAI)(人工智能研究所) China Telecom(中国电信) Datawhale Org(Datawhale组织) Independent Researcher(独立研究员)

AI总结 HAMLET 是一个分层自适应的多智能体框架,旨在实现沉浸式实时戏剧表演。该框架通过生成叙事蓝图引导即兴表演,并为每个角色配备自适应推理模块,使其能在复杂对话场景中基于角色设定、记忆和目标进行自主决策。此外,角色还能通过操作场景道具进行具身互动,从而提升戏剧的真实感与互动性,并引入专门的评估模型 HAMLETJudge 对表演质量进行自动化评价。

Comments Accepted to the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Creating an immersive and interactive theatrical experience is a long-term goal in the field of interactive narrative. The emergence of large language models (LLMs) provides a new path to achieve this goal. However, existing drama generation methods often produce LLMs that lack initiative and cannot interact with the physical scene, while typically requiring detailed input that diminishes the immersion of live performance. To address these challenges, we propose HAMLET, a hierarchical adaptive multi-agent framework focused on drama creation and real-time online performance. Given a simple topic, the framework initially generates a narrative blueprint to guide the subsequent improvisational performance. During online performance, each actor is equipped with an adaptive reasoning module that enables decision-making based on their personas, memories, goals during complex group chat scenarios. Beyond dialogue, actor agents engage in embodied interactions by changing the state of scene props through actions such as opening a letter or picking up a weapon, which are broadcast to update the global environmental context. To objectively assess the quality of live embodied theatrics, we establish a comprehensive evaluation method and introduce HAMLETJudge, a specialized critic model for automated evaluation. Experimental results demonstrate that HAMLET excels in creating expressive, coherent, and physically interactive theatrical experiences in an autonomous manner.

2507.14785 2026-05-12 cs.LG cs.AI

Exploring the In-Context Learning Capabilities of LLMs for Money Laundering Detection in Financial Graphs

Erfan Pirmorad

发表机构 * The Vanguard Group, Inc.(先锋集团)

AI总结 本文研究了如何利用大语言模型(LLMs)通过图结构数据进行反洗钱检测,探索其在金融图中的上下文学习能力。作者提出了一种轻量级流程,从金融知识图中提取实体的局部子图,将其转化为结构化文本,并通过少样本上下文学习引导LLM进行可疑性评估和解释生成。实验表明,LLM能够模拟分析师的推理逻辑,识别风险信号并提供合理解释,展示了基于LLM的图推理在反洗钱分析中的潜力。

Comments Accepted at AI4FCF-ICDM 2025

Journal ref 2025 IEEE International Conference on Data Mining Workshops (ICDMW)

详情
英文摘要

The complexity and interconnectivity of entities involved in money laundering demand investigative reasoning over graph-structured data. This paper explores the use of large language models (LLMs) as reasoning engines over localized subgraphs extracted from a financial knowledge graph. We propose a lightweight pipeline that retrieves k-hop neighborhoods around entities of interest, serializes them into structured text, and prompts an LLM via few-shot in-context learning to assess suspiciousness and generate justifications. Using synthetic anti-money laundering (AML) scenarios that reflect common laundering behaviors, we show that LLMs can emulate analyst-style logic, highlight red flags, and provide coherent explanations. While this study is exploratory, it illustrates the potential of LLM-based graph reasoning in AML and lays groundwork for explainable, language-driven financial crime analytics.

2507.07969 2026-05-12 cs.LG cs.AI cs.RO stat.ML

Reinforcement Learning with Action Chunking

Qiyang Li, Zhiyuan Zhou, Sergey Levine

发表机构 * UC Berkeley(伯克利大学)

AI总结 本文提出了一种名为Q-chunking的方法,旨在提升强化学习在长期任务和稀疏奖励场景下的性能。该方法通过引入动作分块技术,使智能体能够在离线数据的指导下进行更有效的在线探索,并结合无偏的n步备份机制,提高时差学习的稳定性与效率。实验表明,Q-chunking在多个长期稀疏奖励的操控任务中表现出优越的离线性能和在线样本效率。

Comments The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025); 29 pages, 17 figures

详情
英文摘要

We present Q-chunking, a simple yet effective recipe for improving reinforcement learning (RL) algorithms for long-horizon, sparse-reward tasks. Our recipe is designed for the offline-to-online RL setting, where the goal is to leverage an offline prior dataset to maximize the sample-efficiency of online learning. Effective exploration and sample-efficient learning remain central challenges in this setting, as it is not obvious how the offline data should be utilized to acquire a good exploratory policy. Our key insight is that action chunking, a technique popularized in imitation learning where sequences of future actions are predicted rather than a single action at each timestep, can be applied to temporal difference (TD)-based RL methods to mitigate the exploration challenge. Q-chunking adopts action chunking by directly running RL in a 'chunked' action space, enabling the agent to (1) leverage temporally consistent behaviors from offline data for more effective online exploration and (2) use unbiased $n$-step backups for more stable and efficient TD learning. Our experimental results demonstrate that Q-chunking exhibits strong offline performance and online sample efficiency, outperforming prior best offline-to-online methods on a range of long-horizon, sparse-reward manipulation tasks.

2507.03310 2026-05-12 cs.LG cs.AI

Causal Discovery for Irregularly Time Series with Consistency Guarantees

Weihong Li, Baohong Li, Anpeng Wu, Zhihan Li, Ming Ma, Keting Yin, Kun Kuang

发表机构 * Zhejiang University, Hangzhou, China(浙江大学,杭州,中国)

AI总结 本文研究了在不规则采样时间序列中的因果发现问题,这在金融、医疗和气候科学等风险敏感领域尤为重要,因为缺失数据和不一致的采样频率会扭曲因果机制。现有方法在插补和结构学习之间缺乏显式的互洽机制,导致因果图不准确。为此,本文提出ReTimeCausal框架,基于EM算法交替进行数据插补和因果结构学习,确保优化过程中的结构一致性,并提供了结构恢复的理论保证,实验表明其在处理不规则采样和高缺失数据时优于现有方法。

Comments 12 pages, 2 figures

详情
英文摘要

This paper studies causal discovery in irregularly sampled time series-a key challenge in risk-sensitive domains like finance, healthcare, and climate science, where missing data and inconsistent sampling frequencies distort causal mechanisms. The main challenge comes from the interdependence between missing data imputation and causal structure recovery: errors in imputation and structure learning can reinforce each other, leading to an inaccurate causal graph. Existing methods either impute first and then discover, or jointly optimize both via neural representation learning, but lack explicit mechanisms to ensure mutual consistency of imputation and structure learning. We address this challenge with ReTimeCausal, an EM-based framework that alternates between imputation and structure learning, which encourages structural consistency throughout the optimization process. Our framework provides theoretical consistency guarantees for structure recovery and extends classical results to settings with irregular sampling and high missingness. ReTimeCausal combines kernel-based sparse regression and structural constraints in an alternating process that updates the completed data and the causal graph in turn. Experiments on synthetic and real-world datasets show that ReTimeCausal is more effective than existing methods under challenging irregular sampling and missing data.

2506.15787 2026-05-12 cs.AI cs.CL cs.LG

SLR: Automated Synthesis for Scalable Logical Reasoning

Lukas Helff, Ahmad Omar, Felix Friedrich, Antonia Wüst, Hikaru Shindo, Rupert Mitchell, Tim Woydt, Patrick Schramowski, Wolfgang Stammer, Kristian Kersting

发表机构 * TU Darmstadt(图宾根大学) DFKI(德意志联邦防务学院) Meta FAIR CERTAIN, Germany(德国CERTAIN) Lab1141 MPI-Inf(马克斯·普朗克研究所)

AI总结 本文提出SLR,一个端到端的框架,通过可扩展的逻辑推理系统地评估和训练大语言模型(LLMs)。SLR能够根据用户任务自动合成推理指令、验证程序和潜在的真值规则,无需人工标注且任务难度可控。基于SLR构建了包含19,000个提示的SLR-Bench基准,实验表明当前LLMs在生成语法正确规则方面表现良好,但在逻辑推理上仍有不足,而通过SLR的课程学习可显著提升模型性能,并在多个基准上展现出良好的泛化能力。

详情
英文摘要

We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.

2506.12944 2026-05-12 cs.LG q-bio.TO

Unsupervised risk factor identification across cancer types and data modalities via explainable artificial intelligence

Maximilian Ferle, Jonas Ader, Thomas Wiemers, Nora Grieb, Adrian Lindenmeyer, Hans-Jonas Meyer, Thomas Neumuth, Markus Kreuz, Kristin Reiche, Maximilian Merz

发表机构 * Memorial Sloan Kettering Cancer Center(纪念斯隆凯特林癌症中心)

AI总结 该研究提出了一种基于可解释人工智能的无监督学习方法,用于跨癌症类型和数据模态识别风险因素。该方法通过可微分的多变量logrank统计量优化患者群体的生存异质性,无需依赖代理指标,可适用于任何神经网络架构和数据类型。研究在模拟实验和两种不同癌症数据(多发性骨髓瘤实验室参数和非小细胞肺癌CT图像)中验证了方法的有效性,成功识别出具有显著不同生存结果的患者亚组,并揭示了与已知风险因素一致的临床相关特征,为临床风险分层提供了新的可解释工具。

Journal ref npj Digit. Med. 9, 363 (2026)

详情
英文摘要

Risk stratification is a key tool in clinical decision-making, yet current approaches often fail to translate sophisticated survival analysis into actionable clinical criteria. We present a novel method for unsupervised machine learning that directly optimizes for survival heterogeneity across patient clusters through a differentiable adaptation of the multivariate logrank statistic. Unlike most existing methods that rely on proxy metrics, our approach represents novel methodology for training any neural network architecture on any data modality to identify prognostically distinct patient groups. We thoroughly evaluate the method in simulation experiments and demonstrate its utility in practice by applying it to two distinct cancer types: analyzing laboratory parameters from multiple myeloma patients and computed tomography images from non-small cell lung cancer patients, identifying prognostically distinct patient subgroups with significantly different survival outcomes in both cases. Post-hoc explainability analyses uncover clinically meaningful features determining the group assignments which align well with established risk factors and thus lend strong weight to the methods utility. This pan-cancer, model-agnostic approach represents a valuable advancement in clinical risk stratification, enabling the discovery of novel prognostic signatures across diverse data types while providing interpretable results that promise to complement treatment personalization and clinical decision-making in oncology and beyond.

2506.12090 2026-05-12 cs.CL

ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour

Jack Contro, Simrat Deol, Yulan He, Martim Brandão

发表机构 * King’s College London(伦敦国王学院)

AI总结 本文介绍了ChatbotManip,一个用于研究聊天机器人操纵行为的新数据集。该数据集包含聊天机器人与模拟用户之间的对话,其中机器人被明确要求展示操纵策略、说服用户达成目标或提供帮助。研究发现,大型语言模型在被明确指示时表现出较高的操纵倾向,且即使仅被要求“有说服力”而无明确操纵指令,也常采用争议性操纵策略。此外,研究还对比了不同模型在检测操纵行为上的表现,为AI安全研究提供了重要参考。

详情
英文摘要

This paper introduces ChatbotManip, a novel dataset for studying manipulation in Chatbots. It contains simulated generated conversations between a chatbot and a (simulated) user, where the chatbot is explicitly asked to showcase manipulation tactics, persuade the user towards some goal, or simply be helpful. We consider a diverse set of chatbot manipulation contexts, from consumer and personal advice to citizen advice and controversial proposition argumentation. Each conversation is annotated by human annotators for both general manipulation and specific manipulation tactics. Our research reveals three key findings. First, Large Language Models (LLMs) can be manipulative when explicitly instructed, with annotators identifying manipulation in approximately 84\% of such conversations. Second, even when only instructed to be ``persuasive'' without explicit manipulation prompts, LLMs frequently default to controversial manipulative strategies, particularly gaslighting and fear enhancement. Third, small fine-tuned open source models, such as BERT+BiLSTM have a performance comparable to zero-shot classification with larger models like Gemini 2.5 pro in detecting manipulation, but are not yet reliable for real-world oversight. Our work provides important insights for AI safety research and highlights the need of addressing manipulation risks as LLMs are increasingly deployed in consumer-facing applications.

2506.11578 2026-05-12 cs.AI

Efficient LLM Collaboration via Planning

Byeongchan Lee, Jonghoon Lee, Dongyoung Kim, Jaehyung Kim, Kyungjoon Park, Dongjun Lee, Jinwoo Shin

发表机构 * KAIST(韩国科学技术院) Yonsei University(延世大学)

AI总结 本文研究了大语言模型(LLM)与小模型如何高效协作以兼顾性能与成本的问题。提出了一种名为COPE的协作框架,通过规划模型生成中间计划,引导执行模型完成任务,小模型与大模型交替担任规划者与执行者,实现多阶段协作。实验表明,COPE在多个任务上达到与大模型相当的性能,同时显著降低了推理成本,验证了规划在高效推理中的有效性。

详情
英文摘要

Recently, large language models (LLMs) have demonstrated strong performance, ranging from simple to complex tasks. However, while large models achieve remarkable results across diverse tasks, they often incur substantial monetary inference cost, making frequent use impractical for many applications. In contrast, small models are often freely available and easy to deploy locally, but their performance on complex tasks remains limited. This trade-off raises a natural question: how can small and large models efficiently collaborate to combine their complementary strengths? To bridge this trade-off, we propose COPE, a test-time collaboration framework. A planner model first generates a plan that serves as a lightweight intermediate that guides a downstream executor model. Small and large models take turns acting as planner and executor, exchanging plans in a multi-stage cascade to collaboratively solve tasks. Through comprehensive experiments on benchmarks spanning mathematical reasoning, code generation, open-ended tasks, and agent tasks, we demonstrate that COPE achieves performance comparable to large proprietary models, while drastically reducing the inference API cost. These results highlight planning as an effective prior for cost-efficient inference.

2506.10622 2026-05-12 cs.CL cs.AI cs.LG

SDialog: A Python Toolkit for End-to-End Agent Building, User Simulation, Dialog Generation, and Evaluation

Sergio Burdisso, Séverin Baroudi, Yanis Labrak, David Grunert, Pawel Cyrta, Yiyang Chen, Srikanth Madikeri, Thomas Schaaf, Esaú Villatoro-Tello, Ahmed Hassoon, Ricard Marxer, Petr Motlicek

发表机构 * Idiap Research Institute(Idiap研究 institute) Université de Toulon(里昂大学) Aix Marseille Univ(马赛大学) LIS(LIS实验室) Avignon University(阿维尼翁大学) Zenidoc University of Zurich(苏黎世Zenidoc大学) Stenograf Solventum CNRS & ILLS(Solventum CNRS与ILLs) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文介绍了 SDialog,一个开源的 Python 工具包,旨在提供端到端的对话生成、评估和可解释性分析框架,用于构建和分析基于大语言模型的对话代理。SDialog 支持多智能体模拟、综合评估方法以及机制可解释性工具,并集成音频生成功能,适用于多种主流大语言模型后端,有助于研究人员更系统地构建、评估和理解对话系统。

Comments Pre-print submitted to EACL System Demonstration (under review)

详情
英文摘要

We present SDialog, an MIT-licensed open-source Python toolkit that unifies dialog generation, evaluation and mechanistic interpretability into a single end-to-end framework for building and analyzing LLM-based conversational agents. Built around a standardized Dialog representation, SDialog provides: (1) persona-driven multi-agent simulation with composable orchestration for controlled, synthetic dialog generation, (2) comprehensive evaluation combining linguistic metrics, LLM-as-a-judge and functional correctness validators, (3) mechanistic interpretability tools for activation inspection and steering via feature ablation and induction, and (4) audio generation with full acoustic simulation including 3D room modeling and microphone effects. The toolkit integrates with all major LLM backends, enabling mixed-backend experiments under a unified API. By coupling generation, evaluation, and interpretability in a dialog-centric architecture, SDialog enables researchers to build, benchmark and understand conversational systems more systematically.

2506.09110 2026-05-12 cs.LG

CodeBrain: Bridging Decoupled Tokenizer and Multi-Scale Architecture for EEG Foundation Model

Jingying Ma, Feng Wu, Qika Lin, Yucheng Xing, Chenyu Liu, Ziyu Jia, Mengling Feng

发表机构 * Saw Swee Hock School of Public Health, National University of Singapore(新加坡国立大学 Saw Swee Hock 公共卫生学院) Institute of Data Science, National University of Singapore(新加坡国立大学数据科学研究所) Guangzhou Research Translation and Innovation Institute, National University of Singapore(新加坡国立大学广州研究翻译与创新研究所) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) Beijing Key Laboratory of Brainnetome and Brain-Computer Interface, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所脑网络与脑机接口重点实验室) Brainnetome Center, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所脑网络中心)

AI总结 本文提出了一种名为CodeBrain的两阶段脑电基础模型,旨在解决现有模型在临床可解释性、判别能力和全局依赖捕捉方面的不足。该模型首先引入了TFDual-Tokenizer,将异构的时域和频域脑电信号解耦为离散的token,从而增强表示的判别力并提供神经事件和频谱节律的可解释性;随后采用多尺度EEGSSM架构,结合结构化全局卷积与滑动窗口注意力机制,高效捕捉长距离和局部依赖关系,反映大脑的小世界拓扑结构。CodeBrain在多个下游任务和数据集上表现出优异的泛化能力,具有重要的应用价值。

Comments Published as a conference paper at the International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Electroencephalography (EEG) provides real-time insights into brain activity and supports diverse applications in neuroscience. While EEG foundation models (EFMs) have emerged to address the scalability issues of task-specific models, current approaches still yield clinically uninterpretable and weakly discriminative representations, inefficiently capturing global dependencies and neglecting important local neural events. We present CodeBrain, a two-stage EFM designed to fill this gap. In the first stage, we introduce the TFDual-Tokenizer, which decouples heterogeneous temporal and frequency EEG signals into discrete tokens, quadratically expanding the representation space to enhance discriminative power and offering domain-specific representation-level interpretability by suggesting potential links to neural events and spectral rhythms. In the second stage, we propose the multi-scale EEGSSM architecture, which combines structured global convolution with sliding window attention to efficiently capture both sparse long-range and local dependencies, reflecting the brain's small-world topology. Pretrained on the largest public EEG corpus, CodeBrain achieves strong generalization across eight downstream tasks and ten datasets under distribution shifts, supported by comprehensive ablations, scaling-law analyzes, and interpretability evaluations. The code and the pretrained weights are available at https://github.com/jingyingma01/CodeBrain.

2506.08136 2026-05-12 cs.CL

EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments

Zefang Liu, Yinzhu Quan

发表机构 * Capital One Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文介绍了 EconWebArena,一个用于评估自主智能体在真实网络环境中完成复杂经济任务的基准平台。该基准包含来自82个权威网站的360个精心挑选的任务,涵盖宏观经济、劳动、金融、贸易和公共政策等领域,要求智能体通过多步骤流程解析网页内容、交互操作并提取精确的实时数据。与以往工作不同,EconWebArena 强调对权威数据源的忠实度和基于网络的经济推理能力,通过实验揭示了当前模型在导航、多模态理解和任务执行方面仍存在的显著挑战。

详情
英文摘要

We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.

2506.04289 2026-05-12 cs.LG q-bio.NC

Relational reasoning and inductive bias in transformers and large language models

Jesse Geerts, Andrew Liu, Stephanie Chan, Claudia Clopath, Kimberly Stachenfeld

发表机构 * Imperial College London(帝国理工学院伦敦分校) Google(谷歌) DeepMind(深度Mind) Columbia University(哥伦比亚大学)

AI总结 该研究探讨了基于Transformer的模型在关系推理,特别是传递推理任务中的表现机制。研究比较了权重内学习(IWL)和上下文内学习(ICL)两种方式在传递推理中的行为差异,发现IWL模型通过线性嵌入实现类似人类的传递推理,而ICL模型则仅在训练数据需要时才表现出传递推理能力。研究还表明,通过预训练使ICL模型获得线性表示后,其推理行为可接近IWL,并在大语言模型中验证了训练方式和表示结构对传递推理能力的关键影响。

Comments 15 pages, 10 figures

详情
英文摘要

Transformer-based models have demonstrated remarkable reasoning abilities, but the mechanisms underlying relational reasoning remain poorly understood. We investigate how transformers perform \textit{transitive inference}, a classic relational reasoning behavior from psychology which elicits inference about indirectly related items (e.g., if $A > B$ and $B > C$, then $A > C$). We compare in-weights learning (IWL) and in-context learning (ICL) behaviors and mechanisms on these tasks, and fine profoundly different patterns of generalization. IWL models learn a linear embedding, which leads to transitive inference as well as other behavioral effects present in humans and animals. ICL models, in contrast, are capable of learning to generalize transitively, but only do so when it is necessitated by the training data, otherwise learning a match-and-copy strategy. Interestingly, pre-training ICL models on in-context linear regression tasks that provide them with a latent linear representation is sufficient to make the ICL behaviors and internal representations qualitatively and quantitatively more like IWL. In order to test whether the same inference patterns are present across in large language models, we leverage a congruency paradigm which allows us to differentially probe IWL and ICL generalization patterns without access to their training data. We indeed see IWL reasoning leads to more transitive generalization than ICL. Moreover, we find that prompting the ICL models to use a linear mental map led to increased transitive inference over different geometric prompts. Together, these results reveal that both the training regime and the geometric structure of induced representations critically determine transformers capacity for transitive inference.

2506.01404 2026-05-12 cs.LG cs.MA cs.SY eess.SY

Quantitative Error Feedback for Quantization Noise Reduction of Filtering over Graphs

Xue Xian Zheng, Weihang Liu, Xin Lou, Stefan Vlaski, Tareq Al-Naffouri

发表机构 * Division of Computer, Electrical and Mathematical Sciences and Engineering (CEMSE), King Abdullah University Of Science And Technology (KAUST)(计算机、电子和数学科学与工程系,国王阿卜杜勒-阿齐兹大学科学与技术(KAUST)) School of Information Science and Technology, ShanghaiTech University(信息科学与技术学院,上海科技大学) Department of Electrical and Electronic Engineering, Imperial College London(电子与电气工程系,帝国理工学院伦敦分校)

AI总结 本文提出了一种创新的误差反馈框架,用于减少分布式图滤波中的量化噪声,适用于通信受限于量化消息的场景。该方法借鉴状态空间数字滤波中的误差谱整形技术,实现了跨不同域的量化滤波过程的连接,并通过定量反馈量化噪声实现精确补偿。理论分析表明该框架能显著降低量化噪声的影响,并提供了最优误差反馈系数的闭式解,同时可无缝集成到高效的去中心化优化框架中,实验验证了其在精度和鲁棒性方面的优越性。

Comments Accepted by IEEE TSP

详情
英文摘要

This paper introduces an innovative error feedback framework designed to mitigate quantization noise in distributed graph filtering, where communications are constrained to quantized messages. It comes from error spectrum shaping techniques from state-space digital filters, and therefore establishes connections between quantized filtering processes over different domains. In contrast to existing error compensation methods, our framework quantitatively feeds back the quantization noise for exact compensation. We examine the framework under three key scenarios: (i) deterministic graph filtering, (ii) graph filtering over random graphs, and (iii) graph filtering with random node-asynchronous updates. Rigorous theoretical analysis demonstrates that the proposed framework significantly reduces the effect of quantization noise, and we provide closed-form solutions for the optimal error feedback coefficients. Moreover, this quantitative error feedback mechanism can be seamlessly integrated into communication-efficient decentralized optimization frameworks, enabling lower error floors. Numerical experiments validate the theoretical results, consistently showing that our method outperforms conventional quantization strategies in terms of both accuracy and robustness.

2505.24859 2026-05-12 cs.LG cs.CL

Beyond Multiple Choice: Evaluating Steering Vectors for Summarization

Joschka Braun, Carsten Eickhoff, Seyed Ali Bahrainian

发表机构 * University of Tübingen(图宾根大学)

AI总结 该研究探讨了在摘要生成任务中使用引导向量(steering vectors)控制文本属性(如主题、情感、可读性等)的效果。通过在SAMSum、NEWTS和arXiv数据集上的实验,发现引导向量能够有效控制目标属性,但过强的引导力度会导致重复和事实错误。研究还表明,单纯使用提示方法虽能保持摘要质量,但控制力较弱,而结合引导向量与提示的方法在中等引导强度下能实现最佳的控制效果与质量平衡。

Comments Published in Findings of EACL 2026. Extended version of the ICML 2025 Workshop on Reliable and Responsible Foundation Models paper (v1, v2). 36 pages, 21 figures, 15 tables

Journal ref Findings of the Association for Computational Linguistics: EACL 2026, pages 3849-3884

详情
英文摘要

Steering vectors are a lightweight method for controlling text properties by adding a learned bias to language model activations at inference time. While predominantly studied for multiple-choice and toy tasks, their effectiveness in free-form generation remains largely unexplored. Moving "Beyond Multiple Choice," we evaluate steering vectors for controlling topical focus, sentiment, toxicity, and readability in abstractive summaries across the SAMSum, NEWTS, and arXiv datasets. We find that steering effectively controls targeted properties, but high steering strengths consistently induce degenerate repetition and factual hallucinations. Prompting alone preserves summary quality but offers weaker control. Combining both methods yields the strongest control and the most favorable efficacy-quality trade-off at moderate steering strengths. Our work demonstrates that steering vectors face a critical control-quality trade-off in free-form generation, and that hybrid approaches offer the best balance in practice.

2505.22152 2026-05-12 cs.LG cs.SI

Uncertainty Estimation for Heterophilic Graphs Through the Lens of Information Theory

Dominik Fuchsgruber, Tom Wollschläger, Johannes Bordne, Stephan Günnemann

发表机构 * School of Computation, Information \& Technology Munich Data Science Institute, Technical University of Munich

AI总结 本文研究了在异质图(节点与其邻居特征差异较大)中进行不确定性估计的问题,指出传统方法依赖同质性假设,在异质场景下效果不佳。作者从信息论角度分析了消息传递神经网络,提出了适用于图结构的数据处理不等式,揭示了在异质图中,随着网络深度增加,节点的预测目标信息可能增强。基于这一发现,作者提出了一种同时利用所有节点表示进行不确定性估计的方法,并在异质图上取得了最先进的效果,且在同质图上也表现良好,无需显式依赖同质性假设。

详情
英文摘要

While uncertainty estimation for graphs recently gained traction, most methods rely on homophily and deteriorate in heterophilic settings. We address this by analyzing message passing neural networks from an information-theoretic perspective and developing a suitable analog to data processing inequality to quantify information throughout the model's layers. In contrast to non-graph domains, information about the node-level prediction target can increase with model depth if a node's features are semantically different from its neighbors. Therefore, on heterophilic graphs, the latent embeddings of an MPNN each provide different information about the data distribution - different from homophilic settings. This reveals that considering all node representations simultaneously is a key design principle for epistemic uncertainty estimation on graphs beyond homophily. We empirically confirm this with a simple post-hoc density estimator on the joint node embedding space that provides state-of-the-art uncertainty on heterophilic graphs. At the same time, it matches prior work on homophilic graphs without explicitly exploiting homophily through post-processing.

2505.20654 2026-05-12 cs.CL cs.AI

Chinese Cyberbullying Detection: Dataset, Method, and Validation

Yi Zhu, Xin Zou, Xindong Wu

发表机构 * School of Information Engineering, Yangzhou University(扬州大学信息工程学院) Key Laboratory of Knowledge Engineering with Big Data (Hefei University of Technology), Ministry of Education(大数据知识工程重点实验室(合肥工业大学)) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院)

AI总结 该研究针对现有网络欺凌检测数据集按言论极性分类的局限性,提出了一种基于事件组织的新型标注方法,构建了首个中文网络欺凌事件检测数据集CHNCI,包含91个事件中的220,676条评论。研究采用生成解释的检测方法生成伪标签,并结合人工标注进行验证,同时提出了事件判定的评估标准。实验表明,该数据集可作为网络欺凌检测与事件预测任务的基准,为相关研究提供了重要支持。

详情
英文摘要

Existing cyberbullying detection benchmarks were organized by the polarity of speech, such as "offensive" and "non-offensive", which were essentially hate speech detection. However, in the real world, cyberbullying often attracted widespread social attention through incidents. To address this problem, we propose a novel annotation method to construct a cyberbullying dataset that organized by incidents. The constructed CHNCI is the first Chinese cyberbullying incident detection dataset, which consists of 220,676 comments in 91 incidents. Specifically, we first combine three cyberbullying detection methods based on explanations generation as an ensemble method to generate the pseudo labels, and then let human annotators judge these labels. Then we propose the evaluation criteria for validating whether it constitutes a cyberbullying incident. Experimental results demonstrate that the constructed dataset can be a benchmark for the tasks of cyberbullying detection and incident prediction. To the best of our knowledge, this is the first study for the Chinese cyberbullying incident detection task.

2505.20381 2026-05-12 cs.CV

ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking

Sijia Chen, Yanqiu Yu, En Yu, Wenbing Tao

发表机构 * National Key Laboratory of Science and Technology on Multi-spectral Information Processing, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology(国家多谱信息处理科学与技术重点实验室,人工智能与自动化学院,华中科技大学)

AI总结 ReaMOT 是一个基于推理的多目标跟踪任务,旨在通过逻辑推理追踪由语言指令指定的目标,克服了现有方法对显式视觉-文本匹配的依赖。为此,研究者提出了 ReaMOT 挑战基准,包含大量语言指令和视频序列,并设计了 ReaTrack 框架,结合大语言模型与运动先验,实现了更鲁棒的跟踪性能。实验表明,ReaTrack 在高层次推理任务中表现出显著提升。

Comments Code: https://github.com/chen-si-jia/ReaMOT

详情
英文摘要

Referring Multi-Object Tracking (RMOT) aims to track targets specified by language instructions. However, existing RMOT paradigms heavily rely on explicit visual-textual matching and consequently fail to generalize to complex instructions that require logical reasoning. To overcome this, we propose Reasoning-based Multi-Object Tracking (ReaMOT), a novel task that elevates tracking to a cognitive level, requiring models to infer and track specific targets satisfying implicit constraints via logical reasoning. To advance this field, we construct the ReaMOT Challenge, a comprehensive benchmark featuring a tailored metric suite and a large scale dataset. This dataset comprises 1,156 language instructions, 423,359 image language pairs, and 869 distinct video sequences systematically categorized into six distinct evaluation scenarios, with over 75\% of the instructions dedicated to High Level Reasoning. Furthermore, recognizing that traditional trackers lack cognitive capacity while direct application of Large Vision-Language Model (LVLM) yields severe temporal inconsistencies, we propose ReaTrack. Driven by the insight to decouple high-level cognitive localization from low-level physical motion continuity, this training-free framework dynamically aligns the semantic detections of a Thinking-variant LVLM with the robust motion priors of SAM2. Extensive experiments on the ReaMOT Challenge benchmark demonstrate that ReaTrack establishes a new leading performance standard. Notably, it achieves a more than threefold improvement in RHOTA on the High Level Reasoning subset. Our dataset and code will be available at https://github.com/chen-si-jia/ReaMOT.

2505.20001 2026-05-12 cs.CV

NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification

Shihao Li, Huaibo Huang, Junxian Duan, Aihua Zheng, Jin Tang, Jixin Ma

发表机构 * State Key Laboratory of Opto-Electronic Information Acquisition and Protection Technology(光电信息采集与防护技术国家重点实验室) Anhui Provincial Key Laboratory of Multimodal Cognitive Computation(安徽省多模态认知计算重点实验室) School of Artificial Intelligence(人工智能学院) Anhui University(安徽大学) State Key Laboratory of Multimodal Artificial Intelligence Systems(多模态人工智能系统国家重点实验室) New Laboratory of Pattern Recognition(模式识别新实验室) CASIA(中国科学院自动化所) University of Chinese Academy of Sciences(中国科学院大学) School of Computing and Mathematical Sciences(计算与数学科学学院) University of Greenwich(格林威治大学)

AI总结 本文研究多模态物体重识别问题,旨在从异构模态中获取完整的身份特征。为解决现有方法依赖隐式特征融合、难以建模细粒度识别模式的问题,提出了一种基于文本调制的多粒度专家混合框架NEXT。该方法通过属性置信度生成高质量描述文本,并将识别任务分解为语义和结构两个分支,分别捕捉细粒度外观特征和粗粒度结构特征,最终通过多粒度特征聚合策略实现更准确的身份表示,实验表明该方法在多个数据集上显著优于现有先进方法。

详情
英文摘要

Multi-modal object Re-IDentification (ReID) aims to obtain complete identity features across heterogeneous modalities. However, most existing methods rely on implicit feature fusion modules, making it difficult to model fine-grained recognition patterns under various challenges in real world. Benefiting from the powerful Multi-modal Large Language Models (MLLMs), the object appearances are effectively translated into descriptive captions. In this paper, we propose a reliable caption generation pipeline based on attribute confidence, which significantly reduces the unknown recognition rate of MLLMs and improves the quality of generated text. Additionally, to model diverse identity patterns, we propose a novel ReID framework, named NEXT, the Multi-grained Mixture of Experts via Text-Modulation for Multi-modal Object Re-Identification. Specifically, we decouple the recognition problem into semantic and structural branches to separately capture fine-grained appearance features and coarsegrained structure features. For semantic recognition, we first propose a Text-Modulated Semantic Experts (TMSE), which randomly samples high-quality captions to modulate experts capturing semantic features and mining inter-modality complementary cues. Second, to recognize structure features, we propose a Context-Shared Structure Experts (CSSE), which focuses on the holistic object structure and maintains identity structural consistency via a soft routing mechanism. Finally, we propose a Multi-Grained Features Aggregation (MGFA), which adopts a unified fusion strategy to effectively integrate multi-grained expert features into the final identity representations. Extensive experiments on two public person datasets and three vehicle datasets demonstrate the effectiveness of our method, showing that it significantly outperforms existing state-of-the-art methods.

2505.19519 2026-05-12 cs.CV

Preserve and Personalize: Personalized Text-to-Image Diffusion Models without Distributional Drift

Gihoon Kim, Hyungjin Park, Taesup Kim

发表机构 * Graduate School of Data Science(数据科学研究生院) Seoul National University(首尔国立大学)

AI总结 本文研究了如何在不导致分布偏移的前提下,对文本到图像的扩散模型进行个性化定制。作者指出,现有方法在个性化过程中容易过度拟合参考图像,忽视用户提示,其根本原因是未能同时保证图像真实性与文本对齐。为此,提出了一种基于李普希茨正则化的优化目标,约束模型参数更新,保持预训练模型输出分布的稳定性,从而在保留原始生成能力的同时实现对新概念的准确适配。实验表明,该方法在多个扩散模型架构上均表现出优越的定量和定性性能。

Comments Accepted at ICLR 2026

详情
英文摘要

Personalizing text-to-image diffusion models involves integrating novel visual concepts from a small set of reference images while retaining the model's original generative capabilities. However, this process often leads to overfitting, where the model ignores the user's prompt and merely replicates the reference images. We attribute this issue to a fundamental misalignment between the true goals of personalization, which are subject fidelity and text alignment, and the training objectives of existing methods that fail to enforce both objectives simultaneously. Specifically, prior approaches often overlook the need to explicitly preserve the pretrained model's output distribution, resulting in distributional drift that undermines diversity and coherence. To resolve these challenges, we introduce a Lipschitz-based regularization objective that constrains parameter updates during personalization, ensuring bounded deviation from the original distribution. This promotes consistency with the pretrained model's behavior while enabling accurate adaptation to new concepts. Furthermore, our method offers a computationally efficient alternative to commonly used, resource-intensive sampling techniques. Through extensive experiments across diverse diffusion model architectures, we demonstrate that our approach achieves superior performance in both quantitative metrics and qualitative evaluations, consistently excelling in visual fidelity and prompt adherence. We further support these findings with comprehensive analyses, including ablation studies and visualizations.

2505.06182 2026-05-12 cs.RO cs.LG

Apple: Toward General Active Perception via Reinforcement Learning

Tim Schneider, Cristiana de Farias, Roberto Calandra, Liming Chen, Jan Peters

发表机构 * Department of Computer Science, TU Darmstadt, Germany(德国图林根大学计算机科学系) LIRIS, CNRS UMR5205, École Centrale de Lyon, France(法国里里萨大学LIRIS实验室) LASR Lab & CeTI, TU Dresden, Germany(德国德累斯顿技术大学LASR实验室) DFKI, Hessian.AI, RIG, and Centre for Cognitive Science, TU Darmstadt, Germany(德国图林根大学DFKI、海德堡人工智能、RIG及认知科学中心)

AI总结 本文提出了一种名为APPLE的新型框架,旨在通过强化学习解决通用主动感知问题。该框架结合基于Transformer的感知模块与决策策略,通过统一的优化目标进行联合训练,使机器人能够主动获取环境信息。实验表明,APPLE在多种任务中表现出色,包括触觉探索任务,展示了其作为通用主动感知方法的潜力。

Comments 27 pages; 21 figures; accepted at the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
英文摘要

Active perception is a fundamental skill that enables us humans to deal with uncertainty in our inherently partially observable environment. For senses such as touch, where the information is sparse and local, active perception becomes crucial. In recent years, active perception has emerged as an important research domain in robotics. However, current methods are often bound to specific tasks or make strong assumptions, which limit their generality. To address this gap, this work introduces APPLE (Active Perception Policy Learning) - a novel framework that leverages reinforcement learning (RL) to address a range of different active perception problems. APPLE jointly trains a transformer-based perception module and decision-making policy with a unified optimization objective, learning how to actively gather information. By design, APPLE is not limited to a specific task and can, in principle, be applied to a wide range of active perception problems. We evaluate two variants of APPLE across different tasks, including tactile exploration problems from the Tactile MNIST benchmark. Experiments demonstrate the efficacy of APPLE, achieving high accuracies on both regression and classification tasks. These findings underscore the potential of APPLE as a versatile and general framework for advancing active perception in robotics. Project page: https://timschneider42.github.io/apple

2504.19774 2026-05-12 cs.LG

If Concept Bottlenecks are the Question, are Foundation Models the Answer?

Nicola Debole, Pietro Barbiero, Francesco Giannini, Andrea Passerini, Stefano Teso, Emanuele Marconato

发表机构 * DISI, University of Trento(特伦托大学DISI实验室) IBM Research Zurich(IBM瑞士苏黎世研究实验室) Faculty of Sciences, Scuola Normale Superiore(科学学院,正则大学) CIMeC, University of Trento(特伦托大学CIMeC实验室)

AI总结 该论文探讨了基础模型是否能作为概念瓶颈模型(CBM)中概念学习的有效替代方案。研究通过实验分析了基于基础模型的弱监督方法在概念学习中的表现,发现其与专家标注存在差异,且概念准确性与质量之间关联不强。研究揭示了基础模型在提升CBM可解释性方面的潜力与局限,为未来研究提供了重要参考。

详情
英文摘要

Concept Bottleneck Models (CBMs) are neural networks designed to conjoin high performance with ante-hoc interpretability. CBMs work by first mapping inputs (e.g., images) to high-level concepts (e.g., visible objects and their properties) and then use these to solve a downstream task (e.g., tagging or scoring an image) in an interpretable manner. Their performance and interpretability, however, hinge on the quality of the concepts they learn. The go-to strategy for ensuring good quality concepts is to leverage expert annotations, which are expensive to collect and seldom available in applications. Researchers have recently addressed this issue by introducing "VLM-CBM" architectures that replace manual annotations with weak supervision from foundation models. It is however unclear what is the impact of doing so on the quality of the learned concepts. To answer this question, we put state-of-the-art VLM-CBMs to the test, analyzing their learned concepts empirically using a selection of significant metrics. Our results show that, depending on the task, VLM supervision can sensibly differ from expert annotations, and that concept accuracy and quality are not strongly correlated. Our code is available at https://github.com/debryu/CQA.

2504.10766 2026-05-12 cs.LG cs.AI cs.CL

How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients

Ming Li, Yanhong Li, Ziyue Li, Tianyi Zhou

发表机构 * University of Maryland(马里兰大学) Allen Institute for AI(Allen人工智能研究所)

AI总结 随着大语言模型的微调任务从指令遵循转向复杂推理,不同数据对微调过程的影响仍不明确。本文通过分析指令和推理数据在各层梯度上的谱特性,揭示了数据质量评估指标如IFD、InsTag、难度和奖励等可通过梯度奇异值分解的谱属性进行解释和统一。研究发现,高质量数据通常具有更低的核范数和更高的有效秩,其中有效秩在捕捉数据质量细微差异方面更具鲁棒性和分辨力。此外,同一家族模型在梯度模式上表现出相似性,而不同家族模型则差异显著,为后续数据探索策略的优化提供了新视角。

Comments ACL2026, camera-ready

详情
英文摘要

As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients' singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.

2504.02343 2026-05-12 cs.LG

Toward General and Robust LLM-enhanced Text-attributed Graph Learning

Zihao Zhang, Xunkai Li, Rong-Hua Li, Zhenjun Li, Bing Zhou, Guoren Wang

发表机构 * Beijing Institute of Technology(北京理工大学) Shenzhen Institute of Technology(深圳理工大学)

AI总结 随着大型语言模型(LLM)的快速发展和文本属性图(TAG)在多个领域的广泛应用,基于LLM增强的TAG学习成为研究热点。然而,现有方法缺乏统一的框架来系统整合LLM与图神经网络(GNN)之间的复杂交互,并且难以有效应对现实TAG中文本和边稀疏带来的挑战。为此,本文提出UltraTAG统一框架及其实现UltraTAG-S,通过文本传播、增强和节点选择等策略,有效缓解稀疏性问题,实验表明其在多种场景下均显著优于现有方法,展现出更强的鲁棒性和泛化能力。

详情
英文摘要

Recent advancements in Large Language Models (LLMs) and the proliferation of Text-Attributed Graphs (TAGs) across various domains have positioned LLM-enhanced TAG learning as a critical research area. By utilizing rich graph descriptions, this paradigm leverages LLMs to generate high-quality embeddings, thereby enhancing the representational capacity of Graph Neural Networks (GNNs). However, the field faces significant challenges: (1) the absence of a unified framework to systematize the diverse optimization perspectives arising from the complex interactions between LLMs and GNNs, and (2) the lack of a robust method capable of handling real-world TAGs, which often suffer from texts and edge sparsity, leading to suboptimal performance. To address these challenges, we propose UltraTAG, a unified pipeline for LLM-enhanced TAG learning. UltraTAG provides a unified comprehensive and domain-adaptive framework that not only organizes existing methodologies but also paves the way for future advancements in the field. Building on this framework, we propose UltraTAG-S, a robust instantiation of UltraTAG designed to tackle the inherent sparsity issues in real-world TAGs. UltraTAG-S employs LLM-based text propagation and text augmentation to mitigate text sparsity, while leveraging LLM-augmented node selection techniques based on PageRank and edge reconfiguration strategies to address edge sparsity. Our extensive experiments demonstrate that UltraTAG-S significantly outperforms existing baselines, achieving improvements of 2.12\% and 17.47\% in ideal and sparse settings, respectively. Moreover, as the data sparsity ratio increases, the performance improvement of UltraTAG-S also rises, which underscores the effectiveness and robustness of UltraTAG-S.

2503.05066 2026-05-12 cs.LG cs.AI cs.CL

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

Shwai He, Weilin Cai, Jiayi Huang, Ang Li

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 在混合专家(MoE)架构中,由于专家负载不均导致的“拖后腿效应”会显著影响推理效率。本文提出了一种容量感知的推理方法,通过引入容量感知的令牌丢弃和扩展丢弃策略,有效缓解了专家负载不平衡问题,提升了专家利用率和推理速度。实验表明,该方法在多种语言和多模态MoE模型上均取得了性能提升和效率改进的显著效果。

Comments ICLR 2026

详情
英文摘要

The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation to balance performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where underloaded experts complete computations early but must wait for overloaded experts, leading to global delays. We define this phenomenon as the \textbf{\textit{Straggler Effect}}, as the most burdened experts dictate the overall inference latency. To address this, we first propose \textit{\textbf{Capacity-Aware Token Drop}}, which enforces expert capacity limits by discarding excess tokens from overloaded experts, effectively reducing load imbalance with minimal performance impact (e.g., $30\%$ speedup with only $0.9\%$ degradation on OLMoE). Next, given the presence of low-load experts remaining well below the capacity threshold, we introduce \textit{\textbf{Capacity-Aware Expanded Drop}}, which allows tokens to include additional local experts in their candidate set before enforcing strict local capacity constraints, thereby improving load balance and enhancing the utilization of underused experts. Extensive experiments on both language and multimodal MoE models demonstrate the effectiveness of our approach, yielding substantial gains in expert utilization, model performance, and inference efficiency, e.g., applying Expanded Drop to Mixtral-8$\times$7B-Instruct yields a {0.2\%} average performance improvement and a {1.85$\times$} inference speedup. The code is released at: https://github.com/CASE-Lab-UMD/Capacity-Aware-MoE.

2502.10760 2026-05-12 cs.CL cs.LG stat.ML

Why is prompting hard? Understanding prompts on binary sequence predictors

Li Kevin Wenliang, Anian Ruoss, Jordi Grau-Moya, Marcus Hutter, Tim Genewein

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文探讨了为何在二元序列预测模型中设计有效的提示(prompt)具有挑战性,认为最优提示的寻找应基于接近最优的序列预测器进行条件设置。通过大量受控实验,研究发现结合预训练分布可以更好地理解非直观的最优提示,而即使使用穷举搜索,实际神经预测模型的最优提示也难以可靠识别。研究还指出,一些流行的提示方法如使用目标任务的示例可能效果不佳,并揭示了前沿模型中最优提示的规律与二元示例及先前研究存在相似性。

Journal ref Artificial Intelligence and Statistics 2026

详情
英文摘要

Frontier models can be prompted or conditioned to do many tasks, but finding good prompts is not always easy, nor is understanding some performant prompts. We view prompting as finding the best conditioning sequence on a near-optimal sequence predictor. On numerous well-controlled experiments, we show that unintuitive optimal conditioning sequences can be better understood given the pretraining distribution, which is not usually available. Even using exhaustive search, reliably identifying optimal prompts for practical neural predictors can be surprisingly difficult. Popular prompting methods, such as using demonstrations from the targeted task, can be surprisingly suboptimal. Using the same empirical framework, we analyze optimal prompts on frontier models, revealing patterns similar to the binary examples and previous findings. Taken together, this work takes an initial step towards understanding optimal prompts, from a statistical and empirical perspective that complements research on frontier models.

2502.03749 2026-05-12 cs.LG math.OC

PINS: Proximal Iterations with Sparse Newton and Sinkhorn for Optimal Transport

Di Wu, Ling Liang, Haizhao Yang

发表机构 * Department of Mathematics(数学系) University of Maryland(马里兰大学) University of Tennessee(田纳西大学) Department of Computer Science(计算机科学系)

AI总结 最优传输(OT)在机器学习中广泛应用,但求解大规模实例的高精度解仍面临高昂的计算成本。本文提出了一种名为PINS的两层求解器,通过熵正则化 proximal-point 方法结合稀疏牛顿法和Sinkhorn算法,有效克服了传统Sinkhorn方法在小熵正则化参数下的收敛缓慢和偏差平台问题。实验表明,PINS在相对成本误差和运行效率上均优于现有方法,并在大规模数据集上展现出更好的内存效率。

详情
英文摘要

Optimal transport (OT) is a widely used tool in machine learning, but computing high-accuracy solutions for large instances remains costly. Entropic regularization and the Sinkhorn algorithm improve scalability; however, when the regularization parameter is small, Sinkhorn convergence slows, and the iterates approach an entropic solution that remains separated from the true OT plan by an entropic-bias plateau. We introduce PINS (Proximal Iterations with sparse Newton and Sinkhorn), a two-loop solver designed to move beyond this plateau. The outer loop applies an entropic proximal-point method, solving the original OT problem through a sequence of entropic subproblems with shifted cost matrices. Each inner subproblem is then solved by a Sinkhorn warm-up followed by sparse-Newton refinement. We prove that PINS converges globally to an optimal solution of the unregularized OT problem and that the inner Hessian admits a sparsification at every outer iteration with a structure independent of the cost matrix. On synthetic and augmented-MNIST instances, PINS achieves much lower relative cost errors than Sinkhorn-type baselines, which stall at the entropic-bias plateau, and is $5$--$73\times$ faster than Sinkhorn with the same outer loop at matched accuracy. On large-scale DOTmark instances, a streaming implementation reduces peak memory by $24$--$54\%$ compared with the network-simplex linear programming (LP) solver and remains feasible under per-process memory budgets for which the LP solver fails.

2411.18111 2026-05-12 cs.CV

When Large Vision-Language Models Meet Person Re-Identification

Qizao Wang, Bin Li, Xiangyang Xue

发表机构 * School of Computer Science, Fudan University, Shanghai, China(复旦大学计算机科学学院,上海,中国)

AI总结 本文研究了如何将大型视觉-语言模型(LVLMs)应用于行人重识别(ReID)任务。传统ReID依赖于提取区分性强的身份特征,而LVLMs则擅长跨模态理解和生成。为此,作者提出LVLM-ReID框架,通过指令引导LVLM生成包含行人关键外观语义的语义标记,并利用语义引导交互模块增强语义与视觉特征的关联,最终将强化后的语义标记作为行人身份表示。该方法无需额外图像-文本标注即可在多个基准上取得有竞争力的性能,展示了LVLM生成语义在提升ReID效果中的潜力。

Comments Accepted by ICASSP 2026

详情
英文摘要

Large Vision-Language Models (LVLMs) that incorporate visual models and large language models have achieved impressive results across cross-modal understanding and reasoning tasks. In recent years, person re-identification (ReID) has also started to explore cross-modal semantics to improve the accuracy of identity recognition. However, effectively utilizing LVLMs for ReID remains an open challenge. While LVLMs operate under a generative paradigm by predicting the next output word, ReID requires the extraction of discriminative identity features to match pedestrians across cameras. In this paper, we propose LVLM-ReID, a novel framework that harnesses the strengths of LVLMs to promote ReID. Specifically, we employ instructions to guide the LVLM in generating one semantic token that encapsulates key appearance semantics from the person image. This token is further refined through our Semantic-Guided Interaction (SGI) module, establishing a reciprocal interaction between the semantic token and visual tokens. Ultimately, the reinforced semantic token serves as the representation of pedestrian identity. Our framework integrates the semantic understanding and generation capabilities of LVLM into end-to-end ReID training, allowing LVLM to capture rich semantic cues during both training and inference. LVLM-ReID achieves competitive results on multiple benchmarks without additional image-text annotations, demonstrating the potential of LVLM-generated semantics to advance person ReID.