arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2063
2604.24119 2026-06-11 cs.CV 版本更新

TopoHR: Hierarchical Centerline Representation for Cyclic Topology Reasoning in Driving Scenes with Point-to-Instance Relations

TopoHR: 面向驾驶场景中循环拓扑推理的层次化中心线表示与点到实例关系

Yifeng Bai, Zhirong Chen, Bo Song, Erkang Cheng, Haibin Ling

发表机构 * Institute of Intelligent Machines, HFIPS, Chinese Academy of Sciences(智能机器研究所,HFIPS,中国科学院) University of Science and Technology of China(中国科学技术大学) NullMax Westlake University(西湖大学)

AI总结 提出TopoHR框架,通过层次化中心线表示和统一架构中的点到实例与实例到实例关系,实现中心线检测与拓扑推理的循环交互,在OpenLane-V2上取得显著性能提升。

Comments Accepted at CVPR 2026 (camera ready version)

详情
AI中文摘要

拓扑推理对于自动驾驶至关重要。当前方法主要关注用于中心线检测的实例级学习,随后是依赖于简化MLP层的拓扑推理顺序模块。此外,它们常常忽略拓扑推理中\textit{点到实例}(P2I)关系的重要性。为了解决这些局限性,我们提出了TopoHR(拓扑层次化表示),一种新颖的端到端框架,建立了中心线检测与拓扑推理之间的循环交互,使它们能够相互迭代增强。具体来说,我们引入了一种层次化中心线表示,包括点查询、实例查询和语义表示。这些多级特征在层次化中心线解码器中无缝集成和融合。此外,我们设计了一个层次化拓扑推理模块,在统一架构中捕获细粒度的P2I关系和全局的实例到实例(I2I)连接。通过这些新颖的组件,TopoHR确保了准确且鲁棒的拓扑推理。在OpenLane-V2基准上,TopoHR刷新了最先进性能,取得了显著改进。值得注意的是,与先前最佳结果相比,TopoHR在$\text{subset_A}$上实现了+3.8的$\mathrm{DET}_{\text{l}}$、+5.4的$\mathrm{TOP}_{\text{ll}}$,在$\text{subset_B}$上实现了+11.0的$\mathrm{DET}_{\text{l}}$、+7.9的$\mathrm{TOP}_{\text{ll}}$,验证了所提出组件的有效性。代码将在https://this URL公开分享。

英文摘要

Topology reasoning is crucial for autonomous driving. Current methods primarily focus on instance-level learning for centerline detection, followed by a sequential module for topology reasoning that relies on simplified MLP layers. Moreover, they often neglect the importance of \textit{point-to-instance} (P2I) relationships in topology reasoning. To address these limitations, we present TopoHR (Topological Hierarchical Representation), a novel end-to-end framework that establishes cyclic interaction between centerline detection and topology reasoning, allowing them to iteratively enhance each other. Specifically, we introduce a hierarchical centerline representation including point queries, instance queries, and semantic representations. These multi-level features are seamlessly integrated and fused within a hierarchical centerline decoder. Furthermore, we design a hierarchical topology reasoning module that captures both fine-grained P2I relationships and global instance-to-instance (I2I) connections within a unified architecture. With these novel components, TopoHR ensures accurate and robust topology reasoning. On the OpenLane-V2 benchmark, TopoHR refreshes state-of-the-art performance with significant improvements. Notably, compared with previous best results, TopoHR achieves +3.8 in $\mathrm{DET}_{\text{l}}$, +5.4 in $\mathrm{TOP}_{\text{ll}}$ on $\text{subset_A}$ and +11.0 in $\mathrm{DET}_{\text{l}}$, +7.9 in $\mathrm{TOP}_{\text{ll}}$ on $\text{subset_B}$, validating the effectiveness of the proposed components. The code will be shared publicly at https://github.com/Yifeng-Bai/TopoHR.git.

2510.18289 2026-06-11 cs.CL cs.CY cs.MA 版本更新

Food4All: An Agentic Framework and Benchmark for Food Resource Navigation with Adaptive User Understanding

Food4All: 一种具有自适应用户理解能力的食物资源导航智能体框架与基准

Yiyang Li, Weixiang Sun, Tianyi Ma, Kaiwen Shi, Zheyuan Zhang, Yanfang Ye

发表机构 * University of Notre Dame(诺特大学)

AI总结 提出Food4All框架,结合食物搜索工具与300个多轮评估任务,在686个印第安纳食物资源上评估六种大语言模型,诊断其在约束条件处理和非理想用户交互中的不足。

Comments We have further refined the benchmark construction and experimental presentation to improve clarity and consistency. The revised version includes updated task design, food-resource data, and evaluation details to better align the benchmark with the intended food resource referral setting. These changes provide a more precise presentation of the experimental findings

详情
AI中文摘要

食物援助推荐需要对话智能体将未明确指定且常含噪声的求助对话转化为本地有效的资源推荐。我们提出Food4All,一个基于686个结构化印第安纳食物资源的智能体食物资源推荐框架与基准。Food4All将食物特定搜索工具与300个多轮评估任务相结合,涵盖单一食物需求、具有访问或文件约束的复合案例,以及五种非理想用户交互特征:不合理要求、冗长回答、不耐烦、不完整答案和不一致信息。我们在需求理解、资源检索、最终推荐正确性和交互效率上评估了六种大语言模型。尽管最强模型达到了96.33%的推荐准确率,但我们的诊断揭示了在时间安排、资格、接收和文件约束方面的持续失败,以及在最终推荐中未能保留有效检索到的资源。特征级分析进一步表明,不同的非理想行为对推荐流程的不同部分造成压力。Food4All为在现实用户交互挑战下研究约束敏感的食物援助推荐中的工具调用智能体提供了一个受控测试平台。

英文摘要

Food assistance referral requires conversational agents to translate underspecified, often noisy help-seeking dialogues into locally valid resource recommendations. We present Food4All, an agentic food-resource referral framework and benchmark grounded in 686 structured Indiana food resources. Food4All couples a food-specific search tool with 300 multi-turn evaluation tasks spanning single food needs, composite cases with access or document constraints, and five non-ideal user interaction traits: unreasonable demands, rambling responses, impatience, incomplete answers, and inconsistent information. We evaluate six Large Language Models (LLMs) on requirement grounding, resource retrieval, final referral correctness, and interaction efficiency. Although the strongest model achieves 96.33% referral accuracy, our diagnostics reveal persistent failures in grounding schedule, eligibility, intake, and document constraints, as well as failures to preserve valid retrieved resources in the final recommendation. Trait-level analysis further shows that different non-ideal behaviors stress different parts of the referral pipeline. Food4All provides a controlled testbed for studying tool-calling agents in constraint-sensitive food assistance referral under realistic user interaction challenges.

2604.22167 2026-06-11 cs.LG cs.AI 版本更新

Estimating Tail Risks in Language Model Output Distributions

语言模型输出分布中的尾部风险估计

Rico Angell, Raghav Singhal, Zachary Horvitz, Zhou Yu, Rajesh Ranganath, Kathleen McKeown, He He

发表机构 * Columbia University(哥伦比亚大学) Department of Computer Science, New York University(纽约大学计算机科学系) Center for Data Science, New York University(纽约大学数据科学中心)

AI总结 提出一种基于重要性采样的方法,通过创建不安全版本来高效估计语言模型产生有害输出的尾部概率,在10-20倍更少样本下匹配蒙特卡洛估计,并揭示模型对输入的敏感性。

Comments Accepted to ICML 2026

详情
AI中文摘要

语言模型能力日益增强,并正在人口层面快速部署。因此,这些模型的安全性变得越来越重要。幸运的是,对齐方面的进展显著降低了模型产生有害输出的可能性。然而,当模型每天被查询数十亿次时,即使是罕见的 worst-case 行为也会发生。当前的安全评估侧重于捕获产生有害输出的输入分布。这些评估忽略了模型的概率性质及其尾部输出行为。为了衡量这种尾部风险,我们提出了一种方法,可以高效估计任何输入查询产生有害输出的概率。我们不是从目标模型进行简单的暴力采样(其中有害输出可能很罕见),而是通过创建目标模型的不安全版本来实现重要性采样。这些不安全版本通过使有害输出更可能发生,实现了样本高效的估计。在衡量误用和未对齐的基准测试中,这些估计与使用10-20倍更少样本的暴力蒙特卡洛估计相匹配。例如,我们仅用500个样本就可以估计数量级为10^-4的有害输出概率。此外,我们发现这些有害性估计可以揭示模型对输入扰动的敏感性,并预测部署风险。我们的工作表明,准确的小概率事件估计对于安全评估既关键又可行。代码可在以下网址获取:此 https URL

英文摘要

Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample-efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute-force Monte Carlo estimates using 10-20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^-4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare-event estimation is both critical and feasible for safety evaluations. Code is available at https://github.com/rangell/LMTailRisk

2511.14427 2026-06-11 cs.RO cs.LG 版本更新

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

面向接触丰富机器人强化学习的自监督多感官预训练

Rickmer Krohn, Vignesh Prasad, Gabriele Tiboni, Georgia Chalvatzaki

发表机构 * Interactive Robot Perception & Learning (PEARL) Lab, TU Darmstadt, Germany(图腾机器人感知与学习实验室,图腾施塔德大学,德国) Hessian.AI(海斯堡人工智能) Robotics Institute Germany (RIG)(德国机器人研究所(RIG))

AI总结 提出MSDP框架,通过掩码自编码和跨模态预测学习多感官表示,并采用非对称架构(评论家使用交叉注意力提取动态特征,演员使用稳定池化表示)加速策略学习,在模拟和真实机器人任务中展现出鲁棒性和高效性。

Comments 8 pages, 11 figures

详情
Journal ref
IEEE Robotics and Automation Letters, 2026, Vol. 11, No. 6, pp. 6799-6806
AI中文摘要

有效的接触丰富操作需要机器人协同利用视觉、力和本体感觉。然而,强化学习智能体在这种多感官环境中难以学习,尤其是在感官噪声和动态变化的情况下。我们提出了多感官动态预训练(MSDP),一种新颖的框架,用于学习面向任务策略学习的表达性多感官表示。MSDP基于掩码自编码,通过仅从传感器嵌入的子集重建多感官观测来训练基于Transformer的编码器,从而实现跨模态预测和传感器融合。对于下游策略学习,我们引入了一种新颖的非对称架构,其中交叉注意力机制允许评论家从冻结的嵌入中提取动态的、任务特定的特征,而演员则接收稳定的池化表示来指导其动作。我们的方法在多种扰动(包括传感器噪声和物体动力学变化)下表现出加速学习和鲁棒性能。在模拟和真实世界中多个具有挑战性的、接触丰富的机器人操作任务上的评估展示了MSDP的有效性。我们的方法对扰动表现出强鲁棒性,并在仅6000次在线交互的真实机器人上实现了高成功率,为复杂的多感官机器人控制提供了一种简单而强大的解决方案。网站:this https URL

英文摘要

Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control. Website: https://msdp-pearl.github.io/

2604.20348 2026-06-11 cs.RO cs.AI cs.MA 版本更新

Bimanual Robot Manipulation via Multi-Agent In-Context Learning

通过多智能体上下文学习的双臂机器人操作

Alessio Palma, Indro Spinelli, Vignesh Prasad, Luca Scofano, Yufeng Jin, Georgia Chalvatzaki, Fabio Galasso

发表机构 * Sapienza University of Rome(罗马萨皮恩扎大学) TU Darmstadt(达姆施塔特技术大学) Hessian.AI(黑森AI)

AI总结 提出BiCICLe框架,将双臂操作建模为多智能体主从问题,通过解耦动作空间实现标准LLM的少样本学习,在TWIN基准上平均成功率70.5%,超越无训练基线。

详情
AI中文摘要

语言模型(LLMs)已成为具身控制的强大推理引擎。特别是,上下文学习(ICL)使得现成的纯文本LLM能够预测机器人动作,无需任何任务特定训练,同时保持其泛化能力。将ICL应用于双臂操作仍然具有挑战性,因为高维联合动作空间和紧密的臂间协调约束迅速压垮标准上下文窗口。为了解决这个问题,我们引入了BiCICLe(双臂协调上下文学习),这是第一个使标准LLM无需微调即可执行少样本双臂操作的框架。BiCICLe将双臂控制建模为多智能体主从问题,将动作空间解耦为顺序的、条件化的单臂预测。在TWIN基准的13个任务上评估,BiCICLe实现了70.5%的平均成功率,比最佳无训练基线高出6.1个百分点,并超过了大多数监督方法。我们还展示了在3个任务上无需特定硬件重新训练的优越现实世界性能。

英文摘要

Language Models (LLMs) have emerged as powerful reasoning engines for embodied control. In particular, In-Context Learning (ICL) enables off-the-shelf, text-only LLMs to predict robot actions without any task-specific training while preserving their generalization capabilities. Applying ICL to bimanual manipulation remains challenging as the high-dimensional joint action space and tight inter-arm coordination constraints rapidly overwhelm standard context windows. To address this, we introduce BiCICLe (Bimanual Coordinated In-Context Learning), the first framework that enables standard LLMs to perform few-shot bimanual manipulation without fine-tuning. BiCICLe frames bimanual control as a multi-agent leader-follower problem, decoupling the action space into sequential, conditioned single-arm predictions. Evaluated on 13 tasks from the TWIN benchmark, BiCICLe achieves 70.5% average success rate, outperforming the best training-free baseline by 6.1 percentage points and surpassing most supervised methods. We also demonstrate superior real-world performance on 3 tasks without hardware-specific retraining.

2604.16287 2026-06-11 cs.SD 版本更新

NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages

NaijaS2ST:低资源尼日利亚语言的多口音语音到语音翻译基准

Marie Maltais, Yejin Jeon, Min Ma, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Maryam Ibrahim Mukhtar, Daud Abolade, Joel Okepefi, Johnson Sewedo, David Ifeoluwa Adelani

发表机构 * Mila - Quebec AI Institute(魁北克人工智能研究所) McGill University(麦吉尔大学) Google DeepMind(谷歌深Mind) Hausa NLP(豪萨自然语言处理) Imperial College(帝国理工学院) University of Pretoria(南非彼得里亚大学) Masakhane NLP(马萨坎自然语言处理) Naijá Wikipedia Community(尼日利亚维基百科社区) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 针对低资源非洲语言语音翻译数据稀缺问题,构建了涵盖伊博语、豪萨语、约鲁巴语和尼日利亚皮钦语的平行语音数据集NaijaS2ST(每语种约50小时),并系统评估了级联、端到端及音频大模型方法,发现少样本音频大模型在语音到文本翻译中更优,而语音到语音翻译中所有范式性能相近。

Comments Preprint

详情
AI中文摘要

低资源语言的语音翻译仍然受到高质量、多样化平行语音数据稀缺的根本限制,这一挑战在非洲语言背景下尤为突出。为解决此问题,我们引入了NaijaS2ST,一个平行语音翻译数据集,涵盖伊博语、豪萨语、约鲁巴语和尼日利亚皮钦语,并与英语配对。该数据集每种语言包含约50小时的语音,并捕捉了说话人和口音的显著变化,反映了现实的多语言和多口音条件。利用NaijaS2ST,我们对级联、端到端(E2E)和基于AudioLLM的方法在双向翻译设置中进行了全面基准测试。我们的结果表明,具有少样本示例的音频大模型在语音到文本翻译中比基于微调数据的级联和端到端方法更有效。然而,对于语音到语音翻译,级联和音频大模型范式性能相当,表明在此设置下开发针对性的任务特定模型仍有相当大的改进空间。通过提供高质量数据集和系统基准,我们希望NaijaS2ST能成为推动低资源多语言语音翻译研究的有力基础。

英文摘要

Speech translation for low-resource languages remains fundamentally limited by the scarcity of high-quality, diverse parallel speech data, a challenge that is especially pronounced in African linguistic contexts. To address this, we introduce NaijaS2ST, a parallel speech translation dataset spanning Igbo, Hausa, Yorùbá, and Nigerian Pidgin paired with English. The dataset comprises approximately 50 hours of speech per language and captures substantial variation in speakers and accents, reflecting realistic multilingual and multi-accent conditions. With NaijaS2ST, we conduct a comprehensive benchmark of cascaded, end-to-end (E2E), and AudioLLM-based approaches across bidirectional translation settings. Our results show that audio LLMs with few-shot examples are more effective for speech-to-text translation than cascaded and end-to-end methods trained on fine-tuned data. However, for speech-to-speech translation, the cascaded and audio LLM paradigms yield comparable performance, indicating that there is still considerable room for improvement in developing targeted, task-specific models for this setting. By providing both a high-quality dataset and a systematic benchmark, we hope that NaijaS2ST will serve as a strong foundation for advancing research in low-resource, multilingual speech translation.

2604.13733 2026-06-11 cs.LG cs.AI cs.RO 版本更新

Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents

视觉-语言-动作跳跃启动用于强化学习机器人智能体

Angelo Moroncelli, Roberto Zanetti, Marco Maccarini, Loris Roveda

发表机构 * University of Applied Science and Arts of Southern Switzerland, Department of Innovative Technologies(瑞士南方应用科学与艺术大学创新技术系) Università della Svizzera Italiana, Faculty of Informatics, Lugano, Switzerland(瑞士意大利大学信息学院,卢加诺,瑞士)

AI总结 提出VLAJS方法,通过稀疏的VLA高层动作建议引导PPO探索,结合方向性动作一致性正则化,提升强化学习在长时域操作任务中的样本效率,并在仿真和真实机器人上验证。

Comments ICRA 2026 Workshop on Reinforcement Learning in the Era of Imitation Learning

详情
AI中文摘要

强化学习(RL)能够实现机器人操作的高频闭环控制,但由于探索效率低下和信用分配不佳,在稀疏或不完美奖励的长时域任务中难以扩展。视觉-语言-动作(VLA)模型利用大规模多模态预训练提供通用任务级推理,但当前限制阻碍其直接用于快速精确操作。本文提出视觉-语言-动作跳跃启动(VLAJS),一种将稀疏VLA引导与在线策略RL相结合的方法,以改善探索和学习效率。VLAJS将VLA视为高层动作建议的瞬态来源,偏置早期探索并改善信用分配,同时保留RL的高频状态基控制。我们的方法用方向性动作一致性正则化增强近端策略优化(PPO),在早期训练中软对齐RL智能体的动作与VLA引导,而不强制严格模仿、需要演示或依赖持续教师查询。VLA引导稀疏应用并随时间退火,使智能体在线适应并最终超越引导策略。我们在六个挑战性操作任务上评估VLAJS:仿真中的提升、拾取与放置、销钉重定向、销钉插入、戳和推,并在真实Franka Panda机器人上验证子集。VLAJS在样本效率上持续优于PPO和蒸馏式基线,在多个任务中将所需环境交互减少超过50%。真实世界实验展示了零样本仿真到真实迁移以及在杂乱、物体变化和外部扰动下的鲁棒执行。

英文摘要

Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.

2604.13326 2026-06-11 cs.CV 版本更新

Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift

正确区域,错误标签:相关性偏移下分割中的语义标签翻转

Akshit Achara, Yovin Yahathugoda, Nick Byrne, Michela Antonelli, Esther Puyol Anton, Alexander Hammers, Andrew P. King

发表机构 * School of Biomedical Engineering & Imaging Sciences, King’s College London, UK(伦敦国王学院生物医学工程与成像科学学院)

AI总结 研究语义分割中因非因果特征与标签的虚假相关性导致的标签翻转问题,提出翻转诊断指标和基于熵的无标签翻转风险评分。

Comments Author name correction in this version

详情
AI中文摘要

机器学习模型的鲁棒性可能因输入数据中非因果特征与目标标签之间的虚假相关性而受损。测试此类相关性的常见方法是在标签与某些非因果线索强烈关联的数据上训练模型,然后在关联不再成立的示例上进行评估。这一思想在分类任务中已得到充分验证,但对于语义分割,具体的失败模式尚不明确。我们表明,模型可能实现合理的重叠,但分配了错误的语义标签,将一个合理的前景类交换为另一个,即使对象边界大致正确。我们聚焦于这种语义标签翻转行为,并通过一个简单的诊断指标(Flip)进行量化,该指标统计真实前景像素被分配错误前景身份但仍被预测为前景的频率。在训练过程中类别与场景相关的设置下,增加相关性会持续扩大常见与罕见测试条件之间的差距,并增加反事实组内这些对象内部的标签交换。总体而言,我们的结果通过将前景错误分解为正确像素、翻转身份像素和遗漏至背景像素,激励在分布偏移下超越重叠来评估分割鲁棒性。我们还提出了一种基于熵、无需真实标签的“翻转风险”评分,该评分从前景身份不确定性计算得出,并表明它可以在推理时标记易翻转的案例。代码可在此 https URL 获取。

英文摘要

The robustness of machine learning models can be compromised by spurious correlations between non-causal features in the input data and target labels. A common way to test for such correlations is to train on data where the label is strongly tied to some non-causal cue, then evaluate on examples where that tie no longer holds. This idea is well established for classification tasks, but for semantic segmentation the specific failure modes are not well understood. We show that a model may achieve reasonable overlap while assigning the wrong semantic label, swapping one plausible foreground class for another, even when object boundaries are largely correct. We focus on this semantic label-flip behaviour and quantify it with a simple diagnostic (Flip) that counts how often ground truth foreground pixels are assigned the wrong foreground identity while remaining predicted as foreground. In a setting where category and scene are correlated during training, increasing the correlation consistently widens the gap between common and rare test conditions and increases these within-object label swaps on counterfactual groups. Overall, our results motivate assessing segmentation robustness under distribution shift beyond overlap by decomposing foreground errors into correct pixels, flipped-identity pixels, and missed-to-background pixels. We also propose an entropy-based, ground truth label-free `flip-risk' score, which is computed from foreground identity uncertainty, and show that it can flag flip-prone cases at inference time. Code is available at https://github.com/acharaakshit/label-flips.

2604.10242 2026-06-11 cs.CV 版本更新

MedVeriSeg: Teaching LISA-Like Medical Segmentation Models to Verify Query Validity Without Extra Training

MedVeriSeg: 教授LISA-like医学分割模型验证查询的有效性而无需额外训练

Qinyue Tong, Xiaozhen Wang, Ziqian Lu, Jun Liu, Yunlong Yu, Zheming Lu

发表机构 * School of Aeronautics and Astronautics, Zhejiang University(浙江大学航空宇航学院) Southern Medical University(南方医科大学) School of Computer Science and Technology (School of Artificial Intelligence), Zhejiang Sci-Tech University(浙江科技学院计算机科学与技术学院(人工智能学院)) College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息科学与电子工程学院)

AI总结 本文提出MedVeriSeg,一种无需训练的查询验证框架,使LISA-like医学分割模型能够拒绝虚假分割查询。通过相似性响应质量评分模块和轻量级路由多代理验证模块,提升验证鲁棒性,并构建MedVeriSeg-Bench基准,有效减少幻觉分割。

Comments 13 pages, 9 figures

详情
AI中文摘要

尽管文本提示基于医学图像分割取得进展,现有LISA-like MLLM方法通常生成掩码,无论查询中指定的目标是否存在,导致幻觉分割。本文提出MedVeriSeg,一种无需训练的查询验证框架,使LISA-like医学分割模型能够拒绝虚假分割查询。MedVeriSeg首先通过相似性响应质量评分模块量化[SEG]标记与图像特征之间的响应质量。为进一步提高鲁棒性,它采用轻量级路由多代理验证模块,将定量得分证据与定性代理证据融合,以全面验证查询的有效性。为支持系统评估,我们构建了MedVeriSeg-Bench,一个用于医学图像分割查询验证的基准。实验结果表明,MedVeriSeg有效识别虚假分割查询,减少幻觉分割,同时保持对有效查询的高接受率,从而在很大程度上保留LISA-like医学分割模型的分割实用性。

英文摘要

Despite recent progress in text-prompt-based medical image segmentation, existing LISA-like MLLM-based methods typically generate masks regardless of whether the target specified in the query is present, leading to hallucinated segmentation. In this work, we propose MedVeriSeg, a training-free query verification framework that enables LISA-like medical segmentation models to reject false segmentation queries. MedVeriSeg first quantifies the response quality between the [SEG] token and image features through a Similarity Response Quality Scoring Module. To further improve robustness, it employs a Lightweight Routed Multi-Agent Verification Module, which fuses quantitative score evidence with qualitative agent evidence to comprehensively verify the validity of the query. To support systematic evaluation, we construct MedVeriSeg-Bench, a benchmark designed for query verification in medical image segmentation. Experimental results demonstrate that MedVeriSeg effectively identifies false segmentation queries and reduces hallucinated segmentation, while maintaining a high acceptance rate for valid queries, thereby largely preserving the segmentation utility of LISA-like medical segmentation models.

2601.04884 2026-06-11 cs.AI 版本更新

Precomputing Multi-Agent Path Replanning Using Temporal Flexibility

利用时间灵活性预计算多智能体路径重规划

Issa Hanou, Eric Kemmeren, Devin Wild Thomas, Mathijs de Weerdt

发表机构 * Department of Computer Science, University of Waterloo(1 温哥华大学计算机科学系)

AI总结 针对多智能体执行中单个智能体延迟导致冲突的问题,提出FlexSIPP算法,通过预计算延迟智能体的所有可行计划并利用其他智能体的时间灵活性,避免级联延迟,在荷兰铁路网络和MovingAI基准测试中实现高效重规划。

Comments Accepted at SoCS'26

详情
AI中文摘要

当智能体被延迟时,执行多智能体计划可能具有挑战性,因为这通常会导致与其他智能体的冲突。因此,我们需要快速找到一个新的安全计划。仅对延迟的智能体进行重规划通常无法产生有效的计划,有时甚至无法产生可行的计划。另一方面,对其他智能体进行重规划可能导致级联变化和延迟,并且计算成本高昂。我们展示了如何通过跟踪和利用其他智能体的时间灵活性(即智能体在不改变与初始延迟智能体之外的其他智能体的顺序,或进一步延迟其他智能体的前提下,可以承受的最大延迟)来高效地对单个延迟智能体进行重规划,同时避免级联延迟。我们的算法FlexSIPP预计算延迟智能体的所有可能计划,并在给定场景中返回对其他智能体的更改。我们在实际案例研究(荷兰密集使用的铁路网络中的列车重规划)和MovingAI MAPF基准测试集中展示了我们的方法。实验表明,FlexSIPP提供了与实际情况调整相关的有效解决方案,并且在合理的时间范围内。

英文摘要

Executing a multi-agent plan can be challenging when an agent is delayed, because this typically creates conflicts with other agents. So, we need to quickly find a new safe plan. Replanning only the delayed agent often does not yield an efficient plan, and sometimes cannot even yield a feasible one. On the other hand, replanning other agents may lead to a cascade of changes and delays, and it is computationally expensive. We show how to efficiently replan a single delayed agent by tracking and using the temporal flexibility of other agents while avoiding cascading delays. This flexibility is the maximum delay that the agent can take without changing the order with agents other than the initially delayed agent, or further delaying other agents. Our algorithm, FlexSIPP, precomputes all possible plans for the delayed agent and returns the changes to the other agents within the given scenario. We demonstrate our method in a real-world case study of replanning trains in the densely-used Dutch railway network and in the MovingAI MAPF benchmark set. Our experiments show that FlexSIPP provides effective solutions relevant to real-world adjustments, and within a reasonable timeframe.

2604.06961 2026-06-11 cs.CV 版本更新

Auditing Demographic Bias in Facial Landmark Detection for Fair Human-Robot Interaction

审计人脸关键点检测中的群体偏见以实现公平的人机交互

Pablo Parte, Roberto Valle, José M. Buenaposada, Luis Baumela

发表机构 * Departamento de Inteligencia Artificial, Universidad Politécnica de Madrid(智能人工智能部门,马德里理工大学) Departamento de Informática y Estadística, Universidad Rey Juan Carlos(信息与统计学部门,皇家胡安·卡洛斯大学) ELLIS Unit Madrid(马德里ELLIS单位)

AI总结 本研究系统审计了人脸关键点检测中的年龄、性别和种族偏见,通过控制统计方法分离混杂视觉因素,发现头部姿态和分辨率等混杂因素影响更大,但年龄偏见显著存在。

详情
Journal ref
35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)
AI中文摘要

人机交互中的公平性关键取决于使机器人能够解释人类行为的感知模型的可靠性。虽然群体偏见已在高级人脸分析任务中得到广泛研究,但其在人脸关键点检测中的存在尚未被探索。在本文中,我们对该任务中的群体偏见进行了系统审计,分析了年龄、性别和种族偏见。为此,我们引入了一种受控统计方法,以从混杂视觉因素中分离出群体效应。我们的分析表明,视觉混杂因素,特别是头部姿态和人脸分辨率,大大超过了群体属性的影响。值得注意的是,在考虑这些混杂因素后,性别和种族之间的性能差异消失。然而,我们发现了统计上显著的年龄相关偏见,即老年人的定位误差更高。这表明公平性问题甚至可能出现在低级视觉组件中,并可能通过人机交互管道传播。我们认为,审计和纠正此类偏见是实现可信赖和公平的机器人感知系统的必要步骤。

英文摘要

Fairness in human-robot interaction critically depends on the reliability of the perceptual models that enable robots to interpret human behavior. While demographic biases have been widely studied in high-level facial analysis tasks, their presence in facial landmark detection remains unexplored. In this paper, we conduct a systematic audit of demographic bias in this task, analyzing the age, gender, and race biases. To this end, we introduce a controlled statistical methodology to disentangle demographic effects from confounding visual factors. Our analysis demonstrates that visual confounders, particularly head pose and face resolution, heavily outweigh the impact of demographic attributes. Notably, after accounting for these confounders, performance disparities across gender and race vanish. However, we identify a statistically significant age-related bias, with higher localization errors for older individuals. This shows that fairness issues can emerge even in low-level vision components and can propagate through the HRI pipeline. We argue that auditing and correcting such biases is a necessary step toward trustworthy and equitable robot perception systems.

2509.09794 2026-06-11 cs.AI cs.LG 版本更新

Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity

合成住宅:数据稀缺下用于住宅建筑数据生成的多模态生成式AI管道

Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra

发表机构 * Lafayette University(拉法叶大学) Georgia State University(佐治亚州立大学)

AI总结 提出一个多模态生成式AI框架,整合图像、表格和模拟组件,从公开记录和图像生成合成住宅建筑数据集,以解决建筑参数数据稀缺问题。

Comments 37 pages; 2 appendices; 6 figures; 2 tables. Code available at https://github.com/Lafayette-EshbaughSilveyra-Group/synthetic-homes

详情
AI中文摘要

计算模型已成为建筑和城市尺度多尺度能源建模研究的强大工具,支持建筑和城市能源系统的数据驱动分析。然而,这些模型需要大量的建筑参数数据,这些数据通常难以获取、收集成本高昂或受隐私限制。我们引入了一个模块化的多模态生成式人工智能(AI)框架,该框架整合了图像、表格和基于模拟的组件,并从公开的县记录和图像生成合成住宅建筑数据集,同时提出了一个实例化该框架的端到端管道。为了减少典型的大型语言模型(LLM)挑战,我们使用基于遮挡的视觉焦点分析来评估模型组件。我们的分析表明,我们选择的视觉语言模型在建筑图像处理方面比基于GPT的替代方案实现了更大的视觉焦点。我们还根据国家参考数据集评估了结果的真实性,发现我们的合成数据在四个选定变量中的三个重叠率超过95%。这项工作减少了对昂贵或受限数据源的依赖,降低了建筑尺度能源研究和机器学习(ML)驱动的城市能源建模的障碍,从而在数据稀缺的情况下实现了可扩展的下游任务,如能源建模、改造分析和城市尺度模拟。

英文摘要

Computational models have emerged as powerful tools for multi-scale energy modeling research at the building and urban scale, supporting data-driven analysis across building and urban energy systems. However, these models require large amounts of building parameter data that is often inaccessible, expensive to collect, or subject to privacy constraints. We introduce a modular, multimodal generative Artificial Intelligence (AI) framework that integrates image, tabular, and simulation-based components and produces synthetic residential building datasets from publicly available county records and images, and present an end-to-end pipeline instantiating this framework. To reduce typical Large Language Model (LLM) challenges, we evaluate our model's components using occlusion-based visual focus analysis. Our analysis demonstrates that our selected vision-language model achieves greater visual focus than a GPT-based alternative for building image processing. We also assess realism of our results against a national reference dataset, finding that our synthetic data overlaps more than 95% for three of the four selected variables. This work reduces dependence on costly or restricted data sources, lowering barriers to building-scale energy research and Machine Learning (ML)-driven urban energy modeling, and therefore enabling scalable downstream tasks such as energy modeling, retrofit analysis, and urban-scale simulation under data scarcity.

2510.01157 2026-06-11 cs.CL cs.CR cs.SD 版本更新

Where Do Backdoors Live? A Component-Level Analysis of Backdoor Propagation in Speech Language Models

后门藏身何处?语音语言模型中后门传播的组件级分析

Alexandrine Fortier, Thomas Thebaud, Jesús Villalba, Najim Dehak, Patrick Cardinal, Peter West

发表机构 * University of British Columbia(不列颠哥伦比亚大学) École de technologie supérieure(高等技术学院) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文通过后门攻击视角,对语音语言模型进行组件级分析,揭示后门在不同组件中的传播机制,发现后门持久性高度依赖目标组件,且中毒样本与良性样本在共享嵌入中不可直接分离。

Comments Interspeech 2026 (long paper)

详情
AI中文摘要

语音语言模型(SLM)是系统的系统:独立组件联合起来实现共同目标。尽管其异构性,SLM 通常被端到端研究;信息如何流经管道仍然模糊。我们通过后门攻击的视角研究这一问题。我们首先确定后门可以通过 SLM 传播,使所有任务高度脆弱。由此,我们设计了一个组件分析来发现每个组件在后门学习中的作用。我们发现后门的持久性或擦除高度依赖于目标组件。除了传播,我们研究了后门如何在共享的多任务嵌入中被编码,表明中毒样本与良性样本不可直接分离,挑战了过滤防御中常用的可分离性假设。我们的发现强调需要将多模态管道视为具有独特脆弱性的复杂系统,而不仅仅是单模态系统的扩展。

英文摘要

Speech language models (SLMs) are systems of systems: independent components that unite to achieve a common goal. Despite their heterogeneous nature, SLMs are often studied end-to-end; how information flows through the pipeline remains obscure. We investigate this question through the lens of backdoor attacks. We first establish that backdoors can propagate through the SLM, leaving all tasks highly vulnerable. From this, we design a component analysis to discover the role each component takes in backdoor learning. We find that backdoor persistence or erasure is highly dependent on the targeted component. Beyond propagation, we examine how backdoors are encoded in shared multitask embeddings, showing that poisoned samples are not directly separable from benign ones, challenging a common separability assumption used in filtering defenses. Our findings emphasize the need to treat multimodal pipelines as intricate systems with unique vulnerabilities, not solely extensions of unimodal ones.

2603.14867 2026-06-11 cs.LG cs.AI cs.GT cs.MA 版本更新

Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning

用于去中心化双层强化学习的样本高效超梯度估计

Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, Youhei Akimoto

发表机构 * University of Tokyo(东京大学) National Institute of Information and Communications Technology(日本信息与通信技术研究所)

AI总结 针对去中心化双层强化学习中领导者无法干预跟随者优化过程的问题,提出基于玻尔兹曼协方差技巧的超梯度估计方法,实现高维决策空间下的样本高效优化,并首次应用于双人马尔可夫博弈。

Comments 29 pages. Extended version of the paper accepted to ICAPS 2026

详情
AI中文摘要

许多战略决策问题,例如仓库机器人的环境设计,可以自然地表述为双层强化学习,其中领导者代理优化其目标,而跟随者解决一个以领导者决策为条件的马尔可夫决策过程。在许多情况下,当领导者无法干预跟随者的优化过程时,会出现一个基本挑战;它只能观察优化结果。我们通过推导领导者目标的超梯度(即考虑跟随者最优策略变化的领导者策略梯度)来解决这种去中心化设置。与先前基于超梯度的方法不同,这些方法需要大量数据来重复访问状态,或者依赖于梯度估计器,其复杂度可能随着领导者决策空间的高维性而显著增加,我们利用玻尔兹曼协方差技巧推导出一种替代的超梯度公式。这使得仅从交互样本中就能进行高效的超梯度估计,即使领导者的决策空间是高维的。此外,据我们所知,这是第一种能够在去中心化设置中实现基于超梯度的优化的双人马尔可夫博弈方法。实验突出了超梯度更新的影响,并展示了我们的方法在离散和连续状态任务中的有效性。

英文摘要

Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader's decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower's optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader's objective, i.e., the gradient of the leader's strategy that accounts for changes in the follower's optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader's decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader's decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method's effectiveness in both discrete and continuous state tasks.

2603.22934 2026-06-11 cs.AI 版本更新

ProGRank: Probe-Gradient Reranking to Defend Dense-Retriever RAG from Corpus Poisoning

ProGRank: 探针梯度重排序以防御密集检索器RAG免受语料投毒攻击

Xiangyu Yin, Yi Qi, Chih-Hong Cheng

发表机构 * Chalmers University of Technology, Sweden(瑞典查尔姆斯理工大学) University of Leeds, United Kingdom(英国利兹大学) Carl von Ossietzky University of Oldenburg, Germany(德国奥尔登堡卡尔·冯·奥西特齐大学)

AI总结 提出ProGRank,一种无需训练的后处理检索器端防御方法,通过随机扰动下探针梯度提取不稳定信号并重排序,有效防御密集检索器RAG的语料投毒攻击。

Comments accepted by ECML PKDD 2026

详情
AI中文摘要

检索增强生成(RAG)通过将生成基于检索到的证据来改进大语言模型应用,但也引入了语料投毒这一新的攻击面。在此场景中,攻击者注入或编辑段落,使其进入目标查询的Top-K结果并影响下游生成。现有防御通常依赖内容过滤、辅助模型或生成器端推理,这使部署复杂化。我们提出ProGRank,一种针对密集检索器RAG的事后、无需训练的检索器端防御。ProGRank在轻度随机扰动下对每个查询-段落对进行压力测试,从固定小参数子集中提取探针梯度,并推导出两个不稳定信号:表示一致性和分散风险。然后,它将这些信号与分数门控结合进行重排序。ProGRank保留原始段落内容,无需重新训练,并在部署的检索器不可用时支持基于代理的变体。跨数据集、检索器、攻击以及检索阶段和端到端设置的实验表明,ProGRank提高了鲁棒性,并保持了良好的鲁棒性-效用权衡,包括在自适应规避攻击下。

英文摘要

Retrieval-Augmented Generation (RAG) improves large language model applications by grounding generation in retrieved evidence, but also introduces corpus poisoning as a new attack surface. In this setting, an adversary injects or edits passages so that they enter the Top-$K$ results for target queries and influence downstream generation. Existing defences often rely on content filtering, auxiliary models, or generator-side reasoning, which complicates deployment. We propose ProGRank, a post hoc, training-free retriever-side defence for dense-retriever RAG. ProGRank stress-tests each query--passage pair under mild randomized perturbations, extracts probe gradients from a small fixed parameter subset, and derives two instability signals: representational consistency and dispersion risk. It then combines these signals with a score gate for reranking. ProGRank preserves the original passage content, requires no retraining, and supports a surrogate-based variant when the deployed retriever is unavailable. Experiments across datasets, retrievers, attacks, and retrieval-stage and end-to-end settings show that ProGRank improves robustness and maintains a favorable robustness--utility trade-off, including under adaptive evasive attacks.

2603.24080 2026-06-11 cs.CL cs.DB 版本更新

LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale

LLMpedia:一个大规模实现LLM百科全书知识的透明框架

Muhammed Saeed, Simon Razniewski

发表机构 * ScaDS.AI Dresden/Leipzig & TU Dresden, Germany(ScaDS.AI 德累斯顿/莱比锡及德累斯顿技术大学,德国)

AI总结 提出LLMpedia框架,从三个模型家族中提取约130万篇百科全书文章,通过维基百科和网络证据审计,发现可验证真实率远低于MMLU基准,揭示了模型知识的事实性差距。

详情
AI中文摘要

像MMLU这样的基准测试表明,旗舰语言模型的事实性饱和度超过90%。LLMpedia显示这一图景并不完整。我们从三个模型家族的参数记忆中具体化出约130万篇百科全书文章,然后针对维基百科和精选网络证据审计每一条声明。对于gpt-5-mini,在维基百科覆盖的主题上,可验证真实率为68.4%——比MMLU低超过21个百分点——这一差距主要由不可验证性(30.5%)驱动,而非反驳(1.2%)。在维基百科之外,针对精选网络证据审计的前沿文章达到57.6%;维基百科仅覆盖模型呈现主题的56.7%,三个模型家族在主题选择上仅有7.3%的重叠。在受先前Grokipedia分析启发的检索陷阱基准测试中,LLMpedia在文本相似度约为维基百科一半的情况下更加事实准确。每个提示、文章和判决都已发布。数据、代码、界面:此 https URL。

英文摘要

Benchmarks like MMLU suggest flagship language models approach factuality saturation above 90\%. \emph{LLMpedia} shows this picture is incomplete. We materialize ${\sim}$1.3M encyclopedia articles entirely from parametric memory across three model families, then audit every claim against Wikipedia and curated web evidence. For \texttt{gpt-5-mini}, the verifiable true rate is 68.4\% on Wikipedia-covered subjects - more than 21\,pp below MMLU - and the gap is driven by \emph{unverifiability} (30.5\%), not refutation (1.2\%). Beyond Wikipedia, frontier articles audited against curated web evidence reach 57.6\%; Wikipedia covers only 56.7\% of model-surfaced subjects, and three model families overlap in just 7.3\% of subject choices. In a retrieval-trap benchmark inspired by prior analysis of Grokipedia, LLMpedia is more factual at roughly half the textual similarity to Wikipedia. Every prompt, article, and verdict is released. Data, code, interface: https://llmpedia.net.

2601.22725 2026-06-11 cs.CV cs.AI 版本更新

OpenVTON-Bench: A Large-Scale High-Resolution Benchmark for Controllable Virtual Try-On Evaluation

OpenVTON-Bench:用于可控虚拟试穿评估的大规模高分辨率基准

Jin Li, Tao Chen, Kai Wen, Siqi Yin, Shuai Jiang, Weijie Wang, Jingwen Luo, Chenhui Wu

发表机构 * Renxing Intelligence, Hangzhou, China Hangzhou Dianzi University, Hangzhou, China(杭州电子科技大学)

AI总结 提出OpenVTON-Bench,包含约10万对高分辨率图像,通过DINOv3聚类和Gemini描述构建,并设计多模态评估协议,沿五个维度衡量试穿质量,与人类判断高度一致。

Comments Under review for the NeurIPS 2026 Datasets and Benchmarks Track

详情
AI中文摘要

近期扩散模型的进展显著提升了虚拟试穿(VTON)系统的视觉保真度,但可靠的评估仍是一个持续的瓶颈。传统指标难以量化细粒度的纹理细节和语义一致性,而现有数据集在规模和多样性上无法满足商业标准。我们提出了OpenVTON-Bench,一个大规模基准,包含约10万对高分辨率图像(最高$1536 \ imes 1536$)。该数据集使用基于DINOv3的层次聚类进行语义平衡采样,并借助Gemini驱动的密集描述,确保在20个细粒度服装类别上均匀分布。为支持可靠评估,我们提出了一种多模态协议,沿五个可解释维度衡量VTON质量:背景一致性、身份保真度、纹理保真度、形状合理性和整体真实感。该协议将基于VLM的语义推理与基于SAM3分割和形态学腐蚀的新型多尺度表示度量相结合,能够分离边界对齐误差与内部纹理伪影。实验结果表明,该协议与人类判断高度一致(Kendall's $\ au$为0.833,而SSIM为0.611),为VTON评估建立了稳健的基准。

英文摘要

Recent advances in diffusion models have significantly elevated the visual fidelity of Virtual Try-On (VTON) systems, yet reliable evaluation remains a persistent bottleneck. Traditional metrics struggle to quantify fine-grained texture details and semantic consistency, while existing datasets fail to meet commercial standards in scale and diversity. We present OpenVTON-Bench, a large-scale benchmark comprising approximately 100K high-resolution image pairs (up to $1536 \times 1536$). The dataset is constructed using DINOv3-based hierarchical clustering for semantically balanced sampling and Gemini-powered dense captioning, ensuring a uniform distribution across 20 fine-grained garment categories. To support reliable evaluation, we propose a multi-modal protocol that measures VTON quality along five interpretable dimensions: background consistency, identity fidelity, texture fidelity, shape plausibility, and overall realism. The protocol integrates VLM-based semantic reasoning with a novel Multi-Scale Representation Metric based on SAM3 segmentation and morphological erosion, enabling the separation of boundary alignment errors from internal texture artifacts. Experimental results show strong agreement with human judgments (Kendall's $τ$ of 0.833 vs. 0.611 for SSIM), establishing a robust benchmark for VTON evaluation.

2603.20190 2026-06-11 cs.CV 版本更新

CoVR-R:Reason-Aware Composed Video Retrieval

CoVR-R: 推理感知的组合视频检索

Omkar Thawakar, Dmitry Demidov, Vaishnav Potlapalli, Sai Prasanna Teja Reddy Bogireddy, Viswanatha Reddy Gajjala, Alaa Mostafa Lasheen, Rao Muhammad Anwer, Fahad Khan

发表机构 * Mohamed bin Zayed University of AI(莫扎德·本·扎耶德人工智能大学) University of Chicago(芝加哥大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Linköping University(林奈大学)

AI总结 提出一种零样本推理优先方法,利用大型多模态模型推断编辑的因果和时序后效,并构建CoVR-Reason基准评估,在隐式效应子集上显著优于强基线。

Comments 9 Pages, 3 Figures

详情
AI中文摘要

组合视频检索(CoVR)旨在根据参考视频和文本修改找到目标视频。先前的工作假设修改文本完全指定了视觉变化,忽略了编辑产生的后效和隐含后果(例如,运动、状态转换、视角或持续时间线索)。我们认为成功的CoVR需要对这些后效进行推理。我们提出了一种推理优先的零样本方法,利用大型多模态模型(i)推断编辑所隐含的因果和时序后果,以及(ii)将得到的推理查询与候选视频对齐,无需任务特定的微调。为了评估CoVR中的推理能力,我们还提出了CoVR-Reason基准,该基准将每个(参考、编辑、目标)三元组与结构化的内部推理轨迹和具有挑战性的干扰项配对,这些干扰项需要预测后效而不是关键词匹配。实验表明,我们的零样本方法在召回率@K上优于强检索基线,并且在隐式效应子集上尤其出色。我们的自动和人工分析证实了检索结果中更高的步骤一致性和效果真实性。我们的发现表明,将推理纳入通用多模态模型可以通过明确考虑因果和时序后效来实现有效的CoVR。这减少了对任务特定监督的依赖,提高了对具有挑战性的隐式效应案例的泛化能力,并增强了检索结果的可解释性。这些结果指向了一个可扩展且原则性的可解释视频搜索框架。模型、代码和基准可在该网址获取。

英文摘要

Composed Video Retrieval (CoVR) aims to find a target video given a reference video and a textual modification. Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects and implicit consequences (e.g., motion, state transitions, viewpoint or duration cues) that emerge from the edit. We argue that successful CoVR requires reasoning about these after-effects. We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching. Experiments show that our zero-shot method outperforms strong retrieval baselines on recall at K and particularly excels on implicit-effect subsets. Our automatic and human analysis confirm higher step consistency and effect factuality in our retrieved results. Our findings show that incorporating reasoning into general-purpose multimodal models enables effective CoVR by explicitly accounting for causal and temporal after-effects. This reduces dependence on task-specific supervision, improves generalization to challenging implicit-effect cases, and enhances interpretability of retrieval outcomes. These results point toward a scalable and principled framework for explainable video search. The model, code, and benchmark are available at https://github.com/mbzuai-oryx/CoVR-R.

2603.00461 2026-06-11 cs.CV 版本更新

ReMoT: Reinforcement Learning with Motion Contrast Triplets

ReMoT: 基于运动对比三元组的强化学习

Cong Wan, Zeyu Guo, Jiangyang Li, SongLin Dong, Yifan Bai, Lin Peng, Zhiheng Ma, Yihong Gong

发表机构 * Xi’an Jiaotong University(西安交通大学) Shenzhen University of Advanced Technology(深圳先进技术大学) DAMO Academy, Alibaba Group(阿里巴巴集团 DAMO 院)

AI总结 提出ReMoT统一训练范式,通过规则生成大规模运动对比数据集和组相对策略优化,解决VLM时空一致性问题,在时空推理任务上提升25.1%。

Comments CVPR 2026 Highlight

详情
AI中文摘要

我们提出ReMoT,一种统一的训练范式,系统地解决VLM在时空一致性方面的基本缺陷——这是导航、机器人和自动驾驶中的关键失败点。ReMoT整合了两个核心组件:(1) 一个基于规则的自动框架,生成ReMoT-16K,这是一个大规模(16.5K三元组)的运动对比数据集,源自视频元注释,超越了昂贵的手动或基于模型的生成。(2) 组相对策略优化,我们经验验证了它在学习这种对比推理时产生最优性能和数据效率,远远超过标准的监督微调。我们还构建了第一个细粒度运动对比三元组基准,用于衡量VLM对细微运动属性(例如相反方向)的辨别能力。由此产生的模型在我们的新基准和多个标准VLM基准上实现了最先进的性能,最终在时空推理任务上实现了惊人的25.1%的性能飞跃。

英文摘要

We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.

2408.02600 2026-06-11 cs.CL 版本更新

BioMamba: Domain-Adaptive Biomedical Language Models

BioMamba: 领域自适应的生物医学语言模型

Ling Yue, Mingzhi Zhu, Sixue Xing, Yunning Cao, Yanbo Wang, Shimin Shan, Jinfei Liu, Vijil Chenthamarakshan, Shaowu Pan, Payel Das, Tianfan Fu

发表机构 * Rensselaer Polytechnic Institute(拉特格斯理工学院) Jiaxing New Jies Thermal Power Co. Ltd.(嘉兴新捷热电有限公司) North University of China(北方大学) Zhejiang University(浙江大学) IBM Research(IBM研究院) Nanjing University(南京大学)

AI总结 提出基于Mamba2的领域自适应预训练方法BioMamba,在PubMed、C4和Wikipedia混合数据上持续训练,显著降低生物医学困惑度并保持通用语言能力。

详情
AI中文摘要

背景。生物医学语言模型应在提升生物医学文本性能的同时保持通用语言模型的流畅性。对于基于Mamba的模型,这种权衡在生物医学文献和临床文本中尚未得到系统研究。方法。我们开发了BioMamba,一个包含五个规模的生物医学Mamba2模型家族,通过在PubMed摘要、Colossal Clean Crawled Corpus (C4)和Wikipedia的80%/10%/10%平衡混合数据上对已发布的公开Mamba2检查点进行持续预训练得到。贡献在于自适应配方和附带的开放权重检查点。结果。在五个规模上,BioMamba一致降低了PubMed困惑度,将Wikipedia风格的保留困惑度提高了1.46-4.72 PPL,而C4困惑度基本不变。在六个域外多项选择基准上,BioMamba保持在Mamba2的+/-3个百分点内,没有系统性退化。经过监督微调后,BioMamba+SFT在每个评估规模上匹配或超过Mamba2+SFT在MIMIC-IV笔记补全和出院总结生成上的表现,并在每个规模上改进了PubMedQA。最强模型(BioMamba-2.7B)在PubMed上达到5.28的困惑度,在BioASQ和PubMedQA上分别达到90.24%和73.00%的准确率。结论。平衡的领域自适应持续预训练配方增强了Mamba2语言模型在生物医学文献和临床文本上的性能,同时保持了通用语言建模的流畅性。

英文摘要

Background. Biomedical language models should improve performance on biomedical text while retaining general-language-modeling fluency. For Mamba-based models, this trade-off has not been systematically studied across biomedical literature and clinical text. Methods. We developed BioMamba, a family of biomedical Mamba2 models at five scales obtained by continued pretraining of released public Mamba2 checkpoints on a balanced 80%/10%/10% mixture of PubMed abstracts, the Colossal Clean Crawled Corpus (C4), and Wikipedia. The contribution is the adaptation recipe and the accompanying open-weight checkpoints. Results. Across five scales, BioMamba consistently lowered PubMed perplexity, improved Wikipedia-style held-out perplexity by 1.46-4.72 PPL, and left C4 perplexity essentially unchanged. On six out-of-domain multiple-choice benchmarks, BioMamba stayed within +/-3 percentage points of Mamba2 with no systematic regression. After supervised fine-tuning, BioMamba+SFT matched or exceeded Mamba2+SFT on MIMIC-IV note completion and discharge summary generation at every evaluated scale, and improved PubMedQA at every scale. The strongest model (BioMamba-2.7B) reached a PubMed perplexity of 5.28 and accuracies of 90.24% and 73.00% on BioASQ and PubMedQA, respectively. Conclusions. A balanced domain-adaptive continued pretraining recipe strengthens Mamba2 language models on biomedical literature and clinical text while preserving general-language-modeling fluency.

2601.10724 2026-06-11 cs.RO 版本更新

Adaptive Sliding Mode Control for Vehicle Platoons with State-Dependent Friction Uncertainty

具有状态依赖摩擦不确定性的车辆队列自适应滑模控制

Rishabh Dev Yadav

发表机构 * Robotics Research Center, International Institute of Information Technology Hyderabad(机器人研究中心,国际信息学院海得拉巴)

AI总结 针对车辆队列中未知且状态依赖的摩擦力,提出一种自适应滑模控制器,无需先验知识即可处理摩擦不确定性,实现速度调节和间距保持。

详情
AI中文摘要

多机器人编队控制在车辆编队、队列、载荷运输和监视等领域有广泛应用。在车辆队列中保持编队需要设计合适的控制方案,能够处理外部干扰和不确定的系统参数,同时保持机器人之间预定义的安全距离。此背景下的一个关键挑战是处理车轮与地面之间未知/不确定的摩擦力,这些摩擦力随路面变化、轮胎磨损和车辆速度而变化。尽管最先进的自适应控制器可以处理先验有界的不确定性,但它们难以准确建模和识别摩擦力,这些摩擦力通常是状态依赖的且无法先验有界。本文提出了一种新的基于轮式移动机器人的车辆队列自适应滑模控制器,无需先验了解摩擦力的参数和结构即可处理其未知和复杂的行为。该控制器利用自适应滑模控制技术来调节队列速度并保持预定义的机器人间距离,即使在存在外部干扰和不确定系统参数的情况下也是如此。该方法包括两个阶段:首先,运动学控制器根据期望轨迹计算期望速度;其次,动力学模型生成命令以实现期望运动。通过分离机器人的运动学和动力学,该方法可以简化控制问题,并实现对轮式移动机器人更高效、更鲁棒的控制。

英文摘要

Multi-robot formation control has various applications in domains such as vehicle troops, platoons, payload transportation, and surveillance. Maintaining formation in a vehicle platoon requires designing a suitable control scheme that can tackle external disturbances and uncertain system parameters while maintaining a predefined safe distance between the robots. A crucial challenge in this context is dealing with the unknown/uncertain friction forces between wheels and the ground, which vary with changes in road surface, wear in tires, and speed of the vehicle. Although state-of-the-art adaptive controllers can handle a priori bounded uncertainties, they struggle with accurately modeling and identifying frictional forces, which are often state-dependent and cannot be a priori bounded. This thesis proposes a new adaptive sliding mode controller for wheeled mobile robot-based vehicle platoons that can handle the unknown and complex behavior of frictional forces without prior knowledge of their parameters and structures. The controller uses the adaptive sliding mode control techniques to regulate the platoon's speed and maintain a predefined inter-robot distance, even in the presence of external disturbances and uncertain system parameters. This approach involves a two-stage process: first, the kinematic controller calculates the desired velocities based on the desired trajectory; and second, the dynamics model generates the commands to achieve the desired motion. By separating the kinematics and dynamics of the robot, this approach can simplify the control problem and allow for more efficient and robust control of the wheeled mobile robot.

2511.05203 2026-06-11 cs.RO 版本更新

SIL: Symbiotic Interactive Learning for Language-Conditioned Human-Agent Co-Adaptation

SIL: 语言条件的人机协同适应的共生交互学习

Linus Nwankwo, Bjoern Ellensohn, Christian Rauch, Elmar Rueckert

发表机构 * Technical University of Leoben(莱博恩技术大学)

AI总结 提出共生交互学习(SIL)框架,实现人类与智能体在共享潜在任务空间中的双向协同适应,通过联合信念状态、FM空间推理和记忆架构,在指令跟随等任务中达到90.4%完成率和0.83信念对齐分数。

详情
AI中文摘要

当今的自主智能体主要由基础模型(FMs)驱动,能够理解自然语言指令并以类似人类的推理解决长期任务。然而,当前的人机交互框架大多遵循单向的主从技术,其中具身智能体被动执行命令而没有互惠学习。这忽视了日常人际交互中协同适应、多轮交互的本质。我们引入了共生交互学习(SIL),一个在共享潜在任务空间中的双向协同适应框架,其中人类和智能体都维护着随交互历史演变的联合信念状态。这使得主动澄清、适应性建议和共享计划细化成为可能。SIL利用FMs进行空间感知和推理,并结合一个三元组损失训练的神经编码器,将FMs的输出嵌入到任务特定的潜在表示中。为了支持任务演变时的长期稳定性,SIL利用情景记忆和语义记忆架构,并通过弹性权重巩固进行正则化以减轻灾难性遗忘。我们在模拟和真实世界的具身任务上评估SIL,包括指令跟随、信息检索、查询导向推理和交互式对话,实现了90.4%的任务完成率和ρ≈0.83的信念对齐分数,比最佳消融实验绝对提高了约20个百分点。演示和资源:此https URL。

英文摘要

Today's autonomous agents, largely driven by foundation models (FMs), can understand natural language instructions and solve long-horizon tasks with human-like reasoning. However, current human-robot interaction frameworks largely follow a one-way master-apprentice technique where the embodied agent passively executes commands without reciprocal learning. This neglects the co-adaptive, multi-turn nature of everyday human-to-human interactions. We introduce symbiotic interactive learning (SIL), a bidirectional co-adaptation framework in a shared latent task space, where both the human and the agent maintain joint belief states that evolve with the interaction history. This enables proactive clarification, adaptive suggestions, and shared plan refinement. SIL leverages FMs for spatial perception and reasoning, together with a triplet-loss-trained neural encoder that grounds the FMs' outputs into task-specific latent representations. To support long-term stability as tasks evolve, SIL utilises episodic and semantic memory architectures, regularised via elastic weight consolidation to mitigate catastrophic forgetting. We evaluate SIL on simulated and real-world embodied tasks, including instruction following, information retrieval, query-oriented reasoning, and interactive dialogue, achieving a $90.4\%$ task completion rate and a belief alignment score of $ρ\approx 0.83$, an absolute improvement of about $20$ percentage points over the best ablations. Demos and resources: https://linusnep.github.io/SIL/.

2511.16672 2026-06-11 cs.CV 版本更新

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

EvoLMM:具有连续奖励的自进化大型多模态模型

Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan

发表机构 * Mohamed bin Zayed University of AI(Mohamed bin Zayed人工智能大学) Aalto University(阿alto大学) Australian National University(澳大利亚国立大学) Linköping University(林肯大学)

AI总结 提出EvoLMM框架,通过单个骨干模型实例化提议者和求解者两个协作智能体,利用连续自奖励过程无监督地提升LMM推理能力,在ChartQA等基准上取得约3%的提升。

Comments 9 pages, 6 figures

详情
AI中文摘要

近年来,大型多模态模型(LMMs)的进展实现了令人印象深刻的推理和感知能力,但大多数现有训练流程仍依赖于人工策划的数据或外部验证的奖励模型,限制了其自主性和可扩展性。在这项工作中,我们致力于以纯无监督方式(无需任何标注数据或奖励蒸馏)提升LMM的推理能力。为此,我们提出了一个名为EvoLMM的自进化框架,该框架从单个骨干模型实例化两个协作智能体:提议者(Proposer),生成多样化的、基于图像的问题;以及求解者(Solver),通过内部一致性解决这些问题,学习过程通过连续的自奖励机制进行。这种动态反馈促进了信息性查询的生成和结构化推理的改进,而无需依赖真实标签或人工判断。当使用流行的Qwen2.5-VL作为基础模型时,我们的EvoLMM在多模态数学推理基准(包括ChartQA、MathVista和MathVision)上取得了约3%的持续提升,仅使用原始训练图像。我们希望这种简单而有效的方法能成为一个坚实的基线,促进未来在完全无监督方式下自我改进LMM的研究。我们的代码和模型可在该https URL获取。

英文摘要

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

2603.12261 2026-06-11 cs.LG cs.AI cs.CV 版本更新

The Latent Color Subspace: Emergent Order in High-Dimensional Chaos

潜在颜色子空间:高维混沌中的涌现秩序

Mateusz Pach, Jessica Bader, Quentin Bouniot, Serge Belongie, Zeynep Akata

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Toronto(多伦多大学) University of Cambridge(剑桥大学) University of Oxford(牛津大学)

AI总结 本文揭示了FLUX.1变分自编码器潜在空间中颜色表示的HSL结构,并提出一种无需训练的闭式潜在空间操作方法,实现对生成图像颜色的预测与显式控制。

Comments Accepted at ICML 2026

详情
AI中文摘要

文本到图像生成模型已取得快速进展,但实现对生成图像的细粒度控制仍然困难,这主要源于对语义信息编码方式的理解有限。我们开发了对FLUX.1 [Dev]变分自编码器潜在空间中颜色表示的解释,揭示了一种反映色相、饱和度和明度的结构。我们通过证明潜在颜色子空间(LCS)解释能够预测并显式控制颜色,验证了其有效性,并引入了一种完全无需训练的FLUX方法,该方法仅基于闭式潜在空间操作。代码可在该https URL获取。

英文摘要

Text-to-image generation models have advanced rapidly, yet achieving fine-grained control over generated images remains difficult, largely due to limited understanding of how semantic information is encoded. We develop an interpretation of the color representation in the Variational Autoencoder latent space of FLUX.1 [Dev], revealing a structure reflecting Hue, Saturation, and Lightness. We verify our Latent Color Subspace (LCS) interpretation by demonstrating that it can both predict and explicitly control color, introducing a fully training-free method in FLUX based solely on closed-form latent-space manipulation. Code is available at https://github.com/ExplainableML/LCS.

2603.09715 2026-06-11 cs.AI 版本更新

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

问题真的重要吗?视觉-语言SFT的无训练数据选择

Peng Sun, Yi Yang, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li

发表机构 * Nanjing University(南京大学) Institute of Information Engineering(信息工程研究所) North University of China(中国北方大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出CVS方法,利用冻结的视觉-语言大模型评估问题对答案有效性的影响,无需训练即可筛选出需要跨模态推理的高质量样本,在多个数据集上以少量数据超越全量训练。

详情
AI中文摘要

视觉指令微调对于提升视觉-语言大模型(VLLMs)至关重要。然而,许多样本可以通过语言模式或常识捷径解决,无需真正的跨模态推理,限制了多模态学习的有效性。先前的数据选择方法通常依赖于代价高昂的代理模型训练,并侧重于难度或多样性,未能捕捉样本对视觉-语言联合推理的真实贡献。在本文中,我们提出CVS,一种基于以下洞见的无训练数据选择方法:对于高质量的多模态样本,引入问题应显著改变模型在给定图像下对答案有效性的评估。CVS利用冻结的VLLM作为评估器,测量在有/无问题条件下答案有效性的差异,从而识别需要视觉-语言联合推理的样本,同时过滤语义冲突噪声。在Vision-Flan和The Cauldron上的实验表明,CVS在数据集上取得了稳定的性能。在Vision-Flan上,CVS仅使用10%和15%的数据就分别比全量训练高出3.5%和4.8%,并且在高度异构的Cauldron数据集上保持鲁棒。此外,与COINCIDE和XMAS相比,CVS分别降低了17.3%和44.4%的计算成本。

英文摘要

Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require vision-language joint reasoning while filtering semantic-conflict noise. Experiments on Vision-Flan and The Cauldron show that CVS achieves solid performance across datasets. On Vision-Flan, CVS outperforms full-data training by 3.5% and 4.8% using only 10% and 15% of the data, respectively, and remains robust on the highly heterogeneous Cauldron dataset. Moreover, CVS reduces computational cost by 17.3% and 44.4% compared to COINCIDE and XMAS.

2603.09555 2026-06-11 cs.LG cs.AI cs.DC cs.PF 版本更新

Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

编译器优先的状态空间对偶性与可移植的 $O(1)$ 自回归缓存推理

Cosmo Santoni, Anmol Thapar

发表机构 * Imperial College London(帝国理工学院伦敦分校)

AI总结 提出一种基于编译器优先的状态空间对偶性(SSD)结构的推理方法,通过标准JAX原语实现无自定义内核的单源推理路径,在TPU和GPU上达到高硬件利用率,且缓存解码速度比全前缀重计算快27-36倍。

Comments 21 pages, 6 figures. Code available at: https://github.com/CosmoNaught/mamba2-jax

详情
AI中文摘要

高吞吐量的Mamba-2推理通常依赖于融合的CUDA和Triton内核,这限制了在不同加速器后端之间的可移植性。我们证明状态空间对偶性(SSD)递归具有编译器友好的结构:对角逐头动态、固定大小分块、以einsum为主的计算以及静态控制流。在标准JAX原语中表达这种结构,可以得到一个无需自定义内核的单源推理路径、一个注册的JAX PyTree缓存以及一个编译后的设备上自回归循环。在单个Google Cloud TPU v6e上,batch-1预填充达到约140 TFLOPS,即15%的模型FLOP利用率(MFU),这是该场景下的屋顶线上限;缓存解码达到高达64%的硬件带宽利用率(HBU)。在4096个token的上下文中,对于五个Mamba-2检查点(参数从130M到2.7B),缓存解码比全前缀重计算快27-36倍。相同的源代码在未修改的情况下可在NVIDIA L40S上运行,其中缓存解码在所有模型规模下均保持序列长度无关。WikiText-103验证困惑度与Triton参考实现mamba_ssm v2.2.2相差在±0.0005以内,隐藏状态在float32舍入容差内一致。代码可在以下网址获取:https://this URL。

英文摘要

High-throughput Mamba-2 inference is usually tied to fused CUDA and Triton kernels, limiting portability across accelerator backends. We show that the state space duality (SSD) recurrence has a compiler-friendly structure: diagonal per-head dynamics, fixed-size chunking, einsum-dominated compute, and static control flow. Expressing this structure in standard JAX primitives gives a single-source inference path with no custom kernels, a registered JAX PyTree cache, and a compiled on-device autoregressive loop. On a single Google Cloud TPU v6e, batch-1 prefill reaches approximately 140 TFLOPS, or 15% model FLOP utilisation (MFU), the roofline ceiling for this regime, and cached decode reaches up to 64% hardware bandwidth utilisation (HBU). At a 4096-token context, cached decode is 27x--36x faster than full-prefix recomputation across five Mamba-2 checkpoints from 130M to 2.7B parameters. The same source runs unmodified on NVIDIA L40S, where cached decode remains sequence-length independent across all model scales. WikiText-103 validation perplexity matches the Triton reference mamba_ssm v2.2.2 within +/-0.0005 points, and hidden states agree to float32 rounding tolerance. Code is available at https://github.com/CosmoNaught/mamba2-jax.

2603.08501 2026-06-11 cs.CL 版本更新

Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA

Fanar-Sadiq:一种用于基于经典伊斯兰问答的多智能体架构

Ummar Abbas, Mourad Ouzzani, Mohamed Y. Eltabakh, Omar Sinan, Gagan Bhatia, Hamdy Mubarak, Majd Hawasly, Mohammed Qusay Hashim, Kareem Darwish, Firoj Alam

发表机构 * Qatar Computing Research Institute(卡塔尔计算研究所) HBKU(哈马德本·卡尔白大学)

AI总结 针对大语言模型在伊斯兰问答中易产生幻觉和错误归因的问题,提出基于多智能体工具增强架构的Fanar-Sadiq系统,通过意图感知路由、检索增强教法回答、精确经文引用和确定性计算器,在公开基准上实现高效准确的伊斯兰问答。

Comments Islamic QA; Religious NLP; Retrieval-Augmented Generation; Multi-Agent LLMs; Tool-Augmented Reasoning; Faithful Generation; Fiqh Reasoning

详情
AI中文摘要

大型语言模型(LLM)能够流畅回答宗教知识查询,但经常产生幻觉并错误归因来源,这在伊斯兰环境中尤其严重,因为用户期望基于经典文本(《古兰经》和圣训)和教法(fiqh)细微差别的回答。检索增强生成(RAG)改善了基础性,但单一的检索-生成流程不足以处理多样化的伊斯兰查询,包括逐字经文、基于引用的指导以及规则约束的计算(如天课和遗产)。为了解决这些挑战,我们提出了Fanar-Sadiq,一个基于多智能体、工具增强架构的双语(阿拉伯语-英语)伊斯兰问答系统。它是Fanar AI平台的核心组件。Fanar-Sadiq将伊斯兰查询路由到智能体工具架构中的专门模块。它支持意图感知路由、带有标准化引用和验证轨迹的检索增强教法回答、带有引文验证的精确经文查找,以及具有教法学派敏感分支的确定性逊尼派天课和遗产计算器。我们在公开的伊斯兰问答基准上评估了端到端系统,显示出强大的有效性和效率。该系统通过API和Web应用程序公开访问,在不到一年的时间内已收到超过190万次访问(此 https URL )。

英文摘要

Large language models (LLMs) can answer religious knowledge queries fluently, yet they often hallucinate and misattribute sources, which is especially consequential in Islamic settings where users expect grounding in canonical texts (Qur'an and Hadith) and jurisprudential (fiqh) nuance. Retrieval-augmented generation (RAG) improves grounding, however, a single retrieve-then-generate pipeline is insufficient for diverse Islamic queries, including verbatim scripture, citation-grounded guidance, and rule-constrained computations such as zakat and inheritance. To address these challenges, we present Fanar-Sadiq, a bilingual Arabic-English Islamic QA system built on a multi-agent, tool-augmented architecture. It is a core component of the Fanar AI platform. Fanar-Sadiq routes Islamic queries to specialized modules within an agentic tool architecture. It supports intent-aware routing, retrieval-grounded fiqh answers with normalized citations and verification traces, exact verse lookup with quotation validation, and deterministic Sunni zakat and inheritance calculators with madhhab-sensitive branching. We evaluate the end-to-end system on public Islamic QA benchmarks and show strong effectiveness and efficiency. It is publicly accessible through an API and Web application and has received over 1.9M accesses in less than a year (https://api.fanar.qa/docs).

2603.06910 2026-06-11 cs.CL 版本更新

Language Shapes Mental Health Evaluations in Large Language Models

语言塑造大型语言模型中的心理健康评估

Jiayi Xu, Xiyang Hu

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Arizona State University(亚利桑那州立大学)

AI总结 研究多语言LLM在心理健康评估中是否因语言不同而产生系统性偏差,发现中文提示比英文提示导致更高的污名相关评分和更保守的抑郁严重度判断。

详情
AI中文摘要

多语言大型语言模型(LLMs)越来越多地用于社会敏感的心理健康场景,包括支持聊天机器人、筛查和内容审核。这引发了一个可靠性问题:语义上等效的心理健康输入是否在不同语言中引发可比较的评估,还是会出现与语言相关的社会和文化背景一致的系统性偏移?我们在英中双语环境中使用GPT-4o和Qwen3-32B,通过一个两层框架来检验这个问题:结构层面的评估取向(通过心理测量污名工具测量)和决策层面的行为(通过二元污名检测和四类抑郁严重度分类测量)。在多种工具和模型中,中文提示比英文提示引发更高的污名相关分数。在决策层面,中文提示降低了对污名化内容的敏感性,并产生更保守的抑郁严重度判断,导致更多的低估错误。这些发现表明,提示语言可以改变基于LLM的心理健康评估中的评估取向和下游行为。它们强调了评估多语言LLM时不仅需要关注整体性能,还需要关注它们是否在社会敏感领域中对不同语言应用了可比较的评估标准。

英文摘要

Multilingual large language models (LLMs) are increasingly used in socially sensitive mental health contexts, including support chatbots, screening, and content moderation. This raises a reliability question: do semantically equivalent mental health inputs elicit comparable evaluations across languages, or systematic shifts consistent with language-associated social and cultural contexts? We examine this question in an English-Chinese setting with GPT-4o and Qwen3-32B using a two-level framework: construct-level evaluative orientation, measured by psychometric stigma instruments, and decision-level behavior, measured by binary stigma detection and four-class depression severity classification. Across instruments and models, Chinese prompts elicit higher stigma-related scores than English prompts. At the decision level, Chinese prompts reduce sensitivity to stigmatizing content and produce more conservative depression severity judgments, leading to more under-estimation errors. These findings show that prompt language can shift both evaluative orientation and downstream behavior in LLM-based mental health evaluation. They highlight the need to evaluate multilingual LLMs not only for aggregate performance, but also for whether they apply comparable evaluative standards across languages in socially sensitive domains.

2603.05573 2026-06-11 cs.LG 版本更新

Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View

为什么深度在可并行化序列模型中重要:一个李代数视角

Gyuryang Heo, Timothy Ngotiaoco, Kazuki Irie, Samuel J. Gershman, Bernardo L. Sabatini

发表机构 * Howard Hughes Medical Institute, Department of Neurobiology, Harvard Medical School(霍华德·休斯医学研究所,哈佛医学院神经生物学系) Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University(自然与人工智能研究学院,哈佛大学) Department of Psychology and Center for Brain Science, Harvard University(心理学系和脑科学中心,哈佛大学)

AI总结 从李代数控制视角,研究可并行化序列模型(如Transformer变体和状态空间模型)的表达能力与深度关系,证明误差随深度增加呈指数下降。

Comments v2: Format update; split former Theorem 3.4 into Theorem 3.4 and Corollary 3.5 for clarity; corrected an indexing error affecting Corollary 3.6, Proposition 3.7, and Figure 2

详情
AI中文摘要

可扩展的序列模型,如Transformer变体和结构化状态空间模型,通常以表达能力换取序列级并行性,从而实现高效训练。本文从李代数控制视角,考察模型在其表达能力范围之外运行时误差的边界及其缩放规律。我们的理论建立了序列模型深度与李代数扩展塔之间的对应关系。与近期理论研究相呼应,我们刻画了常数深度序列模型的李代数类别及其相应的表达能力边界。此外,我们解析推导了近似误差边界,并证明误差随深度增加呈指数下降,这与这些模型的强大实证表现一致。我们通过在符号词和连续值状态追踪问题上的实验验证了理论预测。

英文摘要

Scalable sequence models, such as Transformer variants and structured state-space models, often trade expressivity power for sequence-level parallelism, which enables efficient training. Here we examine the bounds on error and how error scales when models operate outside of their expressivity regimes using a Lie-algebraic control perspective. Our theory formulates a correspondence between the depth of a sequence model and the tower of Lie algebra extensions. Echoing recent theoretical studies, we characterize the Lie-algebraic class of constant-depth sequence models and their corresponding expressivity bounds. Furthermore, we analytically derive an approximation error bound and show that error diminishes exponentially as the depth increases, consistent with the strong empirical performance of these models. We validate our theoretical predictions using experiments on symbolic word and continuous-valued state-tracking problems.

2504.09762 2026-06-11 cs.AI 版本更新

Position: Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!

立场:停止将中间令牌拟人化为推理/思考痕迹!

Subbarao Kambhampati, Karthik Valmeekam, Siddhant Bhambri, Vardhan Palod, Lucas Saldyt, Kaya Stechly, Soumya Rani Samineni, Durgesh Kalwar, Upasana Biswas

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文论证将模型生成的中间令牌拟人化为“推理痕迹”或“思考痕迹”具有误导性,呼吁社区避免此类拟人化。

Comments Appears in ICML 2026. [This is a fork of v1. This fork, while overlapping with v1 in background section, differs both in the overall focus as well as the specific argument against anthropomorphization of reasoning traces]

详情
Journal ref
ICML 2026
AI中文摘要

中间令牌生成(ITG)是一种模型在输出解决方案之前产生输出的方法,已成为提高语言模型在推理任务上性能的标准方法。这些中间令牌被称为“推理痕迹”甚至“思考痕迹”——隐含地将这些痕迹拟人化,暗示它们类似于人类在解决难题时可能采取的步骤,因此可以为最终用户提供模型思考过程的可解释窗口。在这篇立场论文中,我们提出证据表明这种拟人化并非无害的隐喻,而是相当危险——它混淆了这些模型的本质以及如何有效使用它们,并导致可疑的研究。我们呼吁社区避免对中间令牌进行此类拟人化。

英文摘要

Intermediate token generation (ITG), where a model produces output before the solution, has become a standard method to improve the performance of language models on reasoning tasks. These intermediate tokens have been called \say{reasoning traces} or even \say{thinking traces} -- implicitly anthropomorphizing the traces, and implying that these traces resemble steps a human might take when solving a challenging problem, and as such can provide an interpretable window into the operation of the model's thinking process to the end user. In this position paper, we present evidence that this anthropomorphization isn't a harmless metaphor, and instead is quite dangerous -- it confuses the nature of these models and how to use them effectively, and leads to questionable research. We call on the community to avoid such anthropomorphization of intermediate tokens.