arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
专题追踪
2601.03624 2026-05-26 cs.AI

Architecting Agentic Communities using Design Patterns

使用设计模式构建智能体社区

Zoran Milosevic, Fethi Rabhi

发表机构 * School of Computer Science and Engineering, University of New South Wales, Sydney, Australia(新南威尔士大学计算机科学与工程学院,悉尼,澳大利亚) Deontik, Australia(澳大利亚德诺提克)

AI总结 本文提出基于企业分布式系统设计模式的三层分类架构(LLM智能体、智能体AI、智能体社区),并通过临床试验匹配案例验证其形式化框架,为多智能体生态系统的工程化部署提供实践指导与形式化验证能力。

Comments supplementary material accompanying this paper is also attached .. its title is "Complete Agentic AI Design Patterns Catalogue"; Fixed encoding artefacts (garbled em dashes) throughout

详情
AI中文摘要

大型语言模型(LLM)及后续智能体AI技术的快速发展需要系统化的架构指导,以构建复杂的生产级系统。本文提出了一种使用源自企业分布式系统标准、形式化方法和行业实践的设计模式来架构此类系统的方法。我们将这些模式分为三层:LLM智能体(任务特定自动化)、智能体AI(自适应目标寻求者)和智能体社区(AI智能体与人类参与者通过正式角色、协议和治理结构进行协调的组织框架)。我们重点关注智能体社区——涵盖LLM智能体、智能体AI实体和人类的协调框架——这最适用于企业和工业应用。借鉴分布式系统中成熟的协调原则,我们将这些模式置于一个形式化框架中,该框架规定了协作协议,其中AI智能体和人类在受治理的生态系统中扮演角色。这种方法既提供了实践指导,也提供了形式化验证能力,通过问责机制表达组织、法律和伦理规则,确保智能体间通信、协商和意图建模的可操作且可验证的治理。我们通过一个临床试验匹配案例研究验证了该框架。我们的目标是为从业者提供可操作的指导,同时保持动态多智能体生态系统中企业部署所必需的形式化严谨性。

英文摘要

The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for building sophisticated, production-grade systems. This paper presents an approach for architecting such systems using design patterns derived from enterprise distributed systems standards, formal methods, and industry practice. We classify these patterns into three tiers: LLM Agents (task-specific automation), Agentic AI (adaptive goal-seekers), and Agentic Communities (organizational frameworks where AI agents and human participants coordinate through formal roles, protocols, and governance structures). We focus on Agentic Communities - coordination frameworks encompassing LLM Agents, Agentic AI entities, and humans - most relevant for enterprise and industrial applications. Drawing on established coordination principles from distributed systems, we ground these patterns in a formal framework that specifies collaboration agreements where AI agents and humans fill roles within governed ecosystems. This approach provides both practical guidance and formal verification capabilities, enabling expression of organizational, legal, and ethical rules through accountability mechanisms that ensure operational and verifiable governance of inter-agent communication, negotiation, and intent modeling. We validate this framework through a clinical trial matching case study. Our goal is to provide actionable guidance to practitioners while maintaining the formal rigor essential for enterprise deployment in dynamic, multi-agent ecosystems.

2601.02144 2026-05-26 cs.CL cs.AI

Routing by Analogy: kNN-Augmented Expert Assignment for Mixture-of-Experts

类比路由:用于混合专家模型的kNN增强专家分配

Boxuan Lyu, Soichiro Murakami, Hidetaka Kamigaito, Peinan Zhang

发表机构 * Institute of Science Tokyo(东京科学研究院) CyberAgent Nara Institute of Science and Technology(奈良科学技術大學)

AI总结 提出kNN-MoE框架,通过检索历史相似案例的局部最优专家分配来增强MoE路由,使用检索邻居的平均相似度作为置信度混合系数,在分布偏移下提升鲁棒性。

详情
AI中文摘要

混合专家(MoE)架构通过使用参数化“路由器”将令牌分派给稀疏的专家子集,高效地扩展大型语言模型。通常,该路由器被训练一次然后冻结,导致路由决策在分布偏移下变得脆弱。我们通过引入kNN-MoE来解决这一限制,这是一种检索增强的路由框架,它从类似历史案例的记忆中重用局部最优的专家分配。该记忆通过直接优化令牌级路由对数似然以最大化参考集上的似然来离线构建。关键的是,我们使用检索邻居的平均相似度作为置信度驱动的混合系数,从而允许该方法在未找到相关案例时回退到冻结的路由器。实验表明,kNN-MoE优于零样本基线,并且与计算密集型的监督微调相比具有竞争力。

英文摘要

Mixture-of-Experts (MoE) architectures scale large language models efficiently by employing a parametric ``router'' to dispatch tokens to a sparse subset of experts. Typically, this router is trained once and then frozen, rendering routing decisions brittle under distribution shifts. We address this limitation by introducing kNN-MoE, a retrieval-augmented routing framework that reuses locally optimal expert assignments from a memory of similar past cases. This memory is constructed offline by directly optimizing token-wise routing logits to maximize the likelihood on a reference set. Crucially, we use the average similarity of retrieved neighbors as a confidence-driven mixing coefficient, thus allowing the method to fall back to the frozen router when no relevant cases are found. Experiments show that kNN-MoE outperforms the zero-shot baseline and is competitive with computationally intensive supervised fine-tuning.

2601.00553 2026-05-26 cs.CV cs.AI

A Comprehensive Dataset for Human vs. AI Generated Image Detection

人类与AI生成图像检测的综合数据集

Rajarshi Roy, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Gurpreet Singh, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Amitava Das, Amit Sheth, Gaytri Jena, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

发表机构 * 1 Kalyani Government Engineering College, India. 2 IIIT Delhi, India. 3 BITS Pilani Hyderabad Campus, India. 4 University of South Carolina, USA. 5 IIIT Guwahati, India. 6 NIT Silchar, India. 7 San Jos\' e State University, USA. 8 UCLA, USA. 9 Washington State University, USA. 10 Vishwakarma Institute of Information Technology, India. 11 Gandhi Institute for Technological Advancement, India. 12 Meta AI, USA. 13 Amazon AI, USA. 14 BITS Pilani Goa, India.

AI总结 针对AI生成图像检测问题,构建了包含96000个真实与合成数据点的MS COCOAI数据集,并提出了图像真伪分类与生成模型识别两个任务。

详情
AI中文摘要

像Stable Diffusion、DALL-E和MidJourney这样的多模态生成式AI系统从根本上改变了合成图像的创建方式。这些工具推动了创新,但也促进了误导性内容、虚假信息和被操纵媒体的传播。随着生成的图像越来越难以与照片区分,检测它们已成为当务之急。为了应对这一挑战,我们发布了MS COCOAI,这是一个用于AI生成图像检测的新数据集,包含96000个真实和合成数据点,基于MS COCO数据集构建。为了生成合成图像,我们使用了五个生成器:Stable Diffusion 3、Stable Diffusion 2.1、SDXL、DALL-E 3和MidJourney v6。基于该数据集,我们提出了两个任务:(1)将图像分类为真实或生成;(2)识别哪个模型生成了给定的合成图像。该数据集可在https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset获取。

英文摘要

Multimodal generative AI systems like Stable Diffusion, DALL-E, and MidJourney have fundamentally changed how synthetic images are created. These tools drive innovation but also enable the spread of misleading content, false information, and manipulated media. As generated images become harder to distinguish from photographs, detecting them has become an urgent priority. To combat this challenge, we release MS COCOAI, a novel dataset for AI generated image detection consisting of 96000 real and synthetic datapoints, built using the MS COCO dataset. To generate synthetic images, we use five generators: Stable Diffusion 3, Stable Diffusion 2.1, SDXL, DALL-E 3, and MidJourney v6. Based on the dataset, we propose two tasks: (1) classifying images as real or generated, and (2) identifying which model produced a given synthetic image. The dataset is available at https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Image_Dataset.

2512.24331 2026-05-26 cs.CV

Spatial-aware Vision Language Model for Autonomous Driving

面向自动驾驶的空间感知视觉语言模型

Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong

发表机构 * Motional University of Amsterdam(阿姆斯特丹大学)

AI总结 提出LVLDrive框架,通过融合LiDAR点云与视觉语言模型,利用渐进融合Q-Former和空间感知问答数据集,解决3D度量空间推理瓶颈,提升自动驾驶场景理解与决策可靠性。

Comments Accepted to CVPR AutoPilot Workshop 2026

详情
AI中文摘要

尽管视觉语言模型(VLM)通过利用语言模型中的常识在端到端自动驾驶中展现出显著前景,但它们依赖2D图像线索进行复杂场景理解和决策,这成为安全性和可靠性的关键瓶颈。当前基于图像的方法难以进行精确的度量空间推理和几何推断,导致不可靠的驾驶策略。为弥补这一差距,我们提出LVLDrive(LiDAR-视觉-语言),一种新颖框架,通过引入LiDAR点云作为额外输入模态,专门设计用于增强现有VLM的鲁棒3D度量空间理解能力。一个关键挑战在于如何减轻不同3D数据对预训练VLM带来的灾难性干扰。为此,我们引入渐进融合Q-Former,逐步注入LiDAR特征,确保VLM现有知识库的稳定性和保留。此外,我们开发了空间感知问答(SA-QA)数据集,明确教导模型高级3D感知和推理能力。在驾驶基准上的大量实验表明,与仅视觉的对应模型相比,LVLDrive在场景理解、度量空间感知和可靠驾驶决策方面均实现了优越性能。我们的工作强调了显式3D度量数据对于构建可信赖的基于VLM的自主系统的重要性。

英文摘要

While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.

2512.24075 2026-05-26 cs.LG

Evolutionary Physics-Informed Temporal Fusion for Lane-Change Intention Prediction

进化物理信息时间融合用于换道意图预测

Jiazhao Shi, Qiyang Xie, Ziyu Wang, Dongxu Zhang, Yichen Lin, Di Zhu, Chen Xie, Ziwei Wang, Haoyun Zhang, Enliang Li, Zetong Guan

发表机构 * Tandon School of Engineering(工程学院) New York University(纽约大学) Khoury College of Computer Science(计算机科学学院) Northeastern University(东北大学) School of Business(商学院) Wake Forest University(威克森林大学) Independent Researcher(独立研究者) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Carnegie Mellon University(卡内基梅隆大学) University of Pennsylvania(宾夕法尼亚大学) Qualcomm CDMA Technologies(高通CDMA技术) University of Michigan(密歇根大学)

AI总结 提出一种进化物理信息时间融合框架,通过融合从传统交通信号导出的时间描述符和从原始轨迹序列学习的时间嵌入,实现三分类换道意图预测,在highD和exiD数据集上取得高F1分数。

详情
AI中文摘要

早期换道意图预测对于自动驾驶和ADAS至关重要,但由于换道行为依赖于不断变化的交通风险、周围车辆交互和目标车道可行性,而非仅瞬时车辆状态,因此仍具挑战性。本研究提出一种进化物理信息时间融合框架,用于三分类换道意图预测,包括左换道、右换道和不换道。该方法并非仅使用静态物理信息变量,而是从传统交通信号中导出时间描述符,包括风险演化、间隙持续性、反事实车道效用、交互压力梯度、机动可行性和意图一致性。这些描述符与通过序列编码器从原始轨迹序列学习的时间嵌入融合,融合表示用于最终分类。在highD和exiD数据集上,分别在1秒、2秒和3秒预测时域下进行实验。所提模型在highD上达到0.9514、0.9256和0.8872的宏F1分数,在exiD上达到0.9386、0.9070和0.8531。在exiD匝道邻近场景中改进尤为显著,表明时间物理演化在交互丰富的环境中特别有用。这些结果表明,将进化物理信息描述符与学习的时间表示相结合,为早期换道意图预测提供了更动态且可解释的解决方案。

英文摘要

Early lane-change intention prediction is essential for autonomous driving and ADAS, but it remains challenging because lane-changing behavior depends on evolving traffic risk, surrounding-vehicle interactions, and target-lane feasibility rather than only instantaneous vehicle states. This study proposes an evolutionary physics-informed temporal fusion framework for three-class lane-change intention prediction, including left lane change, right lane change, and no lane change. Instead of using static physics-informed variables alone, the proposed method derives temporal descriptors from conventional traffic signals, including risk evolution, gap persistence, counterfactual lane utility, interaction pressure gradient, maneuver feasibility, and intent consistency. These descriptors are fused with temporal embeddings learned from raw trajectory sequences through a sequence encoder, and the fused representation is used for final classification. Experiments are conducted on the highD and exiD datasets under 1\,s, 2\,s, and 3\,s prediction horizons. The proposed model achieves Macro F1-scores of 0.9514, 0.9256, and 0.8872 on highD, and 0.9386, 0.9070, and 0.8531 on exiD, respectively. The improvement is especially pronounced in exiD ramp-adjacent scenarios, indicating that temporal physical evolution is particularly useful in interaction-rich environments. These results demonstrate that combining evolutionary physics-informed descriptors with learned temporal representations provides a more dynamic and interpretable solution for early lane-change intention prediction.

2512.23076 2026-05-26 cs.LG cs.AI cs.HC

Multimodal Functional Maximum Correlation for Emotion Recognition

多模态功能最大相关用于情感识别

Deyang Zheng, Tianyi Zhang, Wenming Zheng, Shujian Yu

发表机构 * Key Laboratory of Child Development and Learning Science (Ministry of Education), School of Biological Sciences and Medical Engineering, Southeast University(儿童发展与学习科学重点实验室(教育部)、生物科学与医学工程学院、东南大学) Department of Artificial Intelligence, Westlake University(人工智能学院、西湖大学) Department of Artificial Intelligence, Vrije Universiteit Amsterdam(人工智能学院、阿姆斯特丹自由大学)

AI总结 提出多模态功能最大相关(MFMC)框架,通过双重总相关目标最大化高阶多模态依赖,在情感识别基准上取得最先进性能。

Comments manuscript accepted by IEEE Transactions on Affective Computing. Code is available at https://github.com/DY9910/MFMC

详情
AI中文摘要

情绪状态表现为中枢和自主系统之间协调但异质的生理反应,这对情感计算中的多模态表示学习构成了基本挑战。学习这种联合动态因情感标注的稀缺性和主观性而进一步复杂化,这推动了自监督学习(SSL)的使用。然而,大多数现有的SSL方法依赖于成对对齐目标,这些目标不足以表征两个以上模态之间的依赖关系,也无法捕捉由协调的脑和自主反应产生的高阶交互。为了解决这一限制,我们提出了多模态功能最大相关(MFMC),一个原则性的SSL框架,通过双重总相关(DTC)目标最大化高阶多模态依赖。通过推导一个紧致的夹逼界并使用基于功能最大相关分析(FMCA)的迹替代进行优化,MFMC直接捕捉联合多模态交互,而不依赖于成对对比损失。在三个公开的情感计算基准上的实验表明,MFMC在受试者依赖和受试者独立评估协议下均一致地达到最先进或具有竞争力的性能,突显了其对受试者间变异性的鲁棒性。特别是,MFMC将CEAP-360VR上的受试者依赖准确率从78.9%提高到86.8%,仅使用EDA信号就将受试者独立准确率从27.5%提高到33.1%。此外,在MAHNOB-HCI最具挑战性的EEG受试者独立划分中,MFMC与最佳方法的差距在0.8个百分点以内。我们的代码可在https://github.com/DY9910/MFMC获取。

英文摘要

Emotional states manifest as coordinated yet heterogeneous physiological responses across central and autonomic systems, posing a fundamental challenge for multimodal representation learning in affective computing. Learning such joint dynamics is further complicated by the scarcity and subjectivity of affective annotations, which motivates the use of self-supervised learning (SSL). However, most existing SSL approaches rely on pairwise alignment objectives, which are insufficient to characterize dependencies among more than two modalities and fail to capture higher-order interactions arising from coordinated brain and autonomic responses. To address this limitation, we propose Multimodal Functional Maximum Correlation (MFMC), a principled SSL framework that maximizes higher-order multimodal dependence through a Dual Total Correlation (DTC) objective. By deriving a tight sandwich bound and optimizing it using a functional maximum correlation analysis (FMCA) based trace surrogate, MFMC captures joint multimodal interactions directly, without relying on pairwise contrastive losses. Experiments on three public affective computing benchmarks demonstrate that MFMC consistently achieves state-of-the-art or competitive performance under both subject-dependent and subject-independent evaluation protocols, highlighting its robustness to inter-subject variability. In particular, MFMC improves subject-dependent accuracy on CEAP-360VR from 78.9% to 86.8%, and subject-independent accuracy from 27.5% to 33.1% using the EDA signal alone. Moreover, MFMC remains within 0.8 percentage points of the best-performing method on the most challenging EEG subject-independent split of MAHNOB-HCI. Our code is available at https://github.com/DY9910/MFMC.

2512.18735 2026-05-26 cs.CV cs.AI

$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

$M^3-Verse$: 大型多模态模型的“找不同”挑战

Kewei Wei, Bocheng Hu, Jie Cao, Xiaohan Chen, Zhengxi Lu, Wubing Xia, Weili Xu, Jiaao Wu, Junchen He, Mingyu Jia, Ciyun Zhao, Ye Sun, Yizhi Li, Zhonghan Zhao, Jian Zhang, Gaoang Wang

发表机构 * Zhejiang University, China(浙江大学) Shanghai AI Lab, China(上海人工智能实验室) Hangzhou Normal University, China(杭州师范大学)

AI总结 提出 $M^3-Verse$ 基准,通过多视角视频对评估 LMM 在一致空间中对物体动态变化的理解能力,并验证了现有模型的局限性。

详情
AI中文摘要

现代大型多模态模型(LMMs)在静态图像和单状态时空理解方面表现出非凡的能力。然而,它们在两个不同视频观测中理解共享空间上下文内物体动态变化的能力仍未被充分探索。这种在一致环境中推理变换的能力对于空间智能领域的进步尤为关键。在本文中,我们引入了 $M^3-Verse$,一个多模态、多状态、多维度的基准,以正式评估这一能力。它基于成对视频,这些视频提供了室内场景在状态变化前后的多视角观察。该基准包含总共 270 个场景和 2,932 个问题,分为 50 多个子任务,探究 4 种核心能力。我们评估了 16 个最先进的 LMMs,并观察到它们在跟踪状态转换方面的局限性。为了解决这些挑战,我们进一步提出了一个简单而有效的基线,在多状态感知中实现了显著的性能提升。因此,$M^3-Verse$ 提供了一个具有挑战性的新测试平台,以促进对动态视觉世界有更全面理解的下一代模型的发展。您可以从 https://github.com/Wal-K-aWay/M3-Verse_pipeline 获取构建流程,并从 https://www.modelscope.cn/datasets/WalKaWay/M3-Verse 获取完整的基准数据。

英文摘要

Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.

2512.15605 2026-05-26 cs.LG stat.ML

Autoregressive Language Models are Secretly Energy-Based Models: Insights into the Lookahead Capabilities of Next-Token Prediction

自回归语言模型实际上是能量模型:对下一个词元预测的预见能力的洞察

Mathieu Blondel, Michael E. Sander, Germain Vivier-Ardisson, Tianlin Liu, Vincent Roulet

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文通过建立自回归模型与能量模型之间的双射,揭示了自回归模型在下一个词元预测范式下具备预见能力,并提供了理论误差界。

详情
AI中文摘要

自回归模型(ARMs)目前构成了大型语言模型(LLMs)的主导范式。能量模型(EBMs)代表了另一类模型,历史上在LLM发展中不太普遍,但自然地刻画了后训练对齐中的最优策略。在本文中,我们提供了这两类模型的统一视角。以概率链式法则为起点,我们在函数空间建立了ARMs和EBMs之间的显式双射,并证明这对应于最大熵强化学习中的软贝尔曼方程的一个特例。基于这一双射,我们推导了ARMs和EBMs的监督学习之间的等价性。此外,我们通过提供理论误差界分析了将EBMs蒸馏为ARMs的过程。我们的结果揭示了ARMs尽管基于下一个词元预测范式,却具备规划能力的原因。

英文摘要

Autoregressive models (ARMs) currently constitute the dominant paradigm for large language models (LLMs). Energy-based models (EBMs) represent another class of models, which have historically been less prevalent in LLM development, yet naturally characterize the optimal policy in post-training alignment. In this paper, we provide a unified view of these two model classes. Taking the chain rule of probability as a starting point, we establish an explicit bijection between ARMs and EBMs in function space, which we show to correspond to a special case of the soft Bellman equation in maximum entropy reinforcement learning. Building upon this bijection, we derive the equivalence between supervised learning of ARMs and EBMs. Furthermore, we analyze the distillation of EBMs into ARMs by providing theoretical error bounds. Our results provide insights into the ability of ARMs to plan ahead, despite being based on the next-token prediction paradigm.

2512.13597 2026-05-26 cs.CV

Lighting in Motion: Spatiotemporal HDR Lighting Estimation

运动中的光照:时空高动态范围光照估计

Christophe Bolduc, Julien Philip, Li Ma, Mingming He, Paul Debevec, Jean-François Lalonde

发表机构 * Université Laval(拉瓦尔大学) Eyeline Labs(Eyeline实验室)

AI总结 提出基于扩散的时空光照估计方法LiMo,通过生成不同曝光下的镜面与漫反射球体,结合深度与几何条件,实现高精度高频细节预测与照度估计。

详情
AI中文摘要

我们提出LiMo(运动中的光照),一种基于扩散的时空光照估计方法。LiMo旨在同时实现逼真的高频细节预测和准确的照度估计。为此,我们提出根据输入中3D位置生成一组不同曝光下的镜面与漫反射球体。利用扩散先验,我们在大规模定制的室内外场景数据集上微调强大的现有扩散模型,并配以时空光照探针。为了实现准确的空间条件,我们证明仅靠深度是不够的,并引入一种新的几何条件来提供场景相对于目标3D位置的相对位置。最后,我们利用可微渲染将不同曝光下的漫反射和镜面预测合并为单个HDRI图。我们彻底评估了我们的方法和设计选择,使LiMo在空间控制和预测精度方面均达到最先进水平。

英文摘要

We present Lighting in Motion (LiMo), a diffusion-based approach to spatiotemporal lighting estimation. LiMo targets both realistic high-frequency detail prediction and accurate illuminance estimation. To account for both, we propose generating a set of mirrored and diffuse spheres at different exposures, based on their 3D positions in the input. Making use of diffusion priors, we fine-tune powerful existing diffusion models on a large-scale customized dataset of indoor and outdoor scenes, paired with spatiotemporal light probes. For accurate spatial conditioning, we demonstrate that depth alone is insufficient and we introduce a new geometric condition to provide the relative position of the scene to the target 3D position. Finally, we combine diffuse and mirror predictions at different exposures into a single HDRI map leveraging differentiable rendering. We thoroughly evaluate our method and design choices to establish LiMo as state-of-the-art for both spatial control and prediction accuracy.

2512.12425 2026-05-26 cs.CV

Boosting Monocular Metric Depth Estimation via Bokeh Rendering

通过散景渲染提升单目度量深度估计

Hangwei Zhang, Armando Fortes, Tianyi Wei, Xingang Pan

发表机构 * S-Lab, Nanyang Technological University(南洋理工大学S实验室) Beihang University(北航大学)

AI总结 提出BokehDepth两阶段框架,利用物理生成模型产生校准散景堆栈作为无监督几何信号,通过散景感知聚合模块提升单目深度估计的度量精度。

Comments Project Page: https://fogradio.github.io/BokehDepth_Project/

Journal ref ICML 2026

详情
AI中文摘要

散景渲染和深度估计共享基本的光学联系,但现有方法未能充分利用这种互惠性。传统的散景管线严重依赖有噪声的深度图,不可避免地引入视觉伪影。相反,现有的单目深度模型通常遵循两种有缺陷的范式。基于生成扩散的框架往往缺乏一致的度量尺度。同时,前馈度量深度模型在纹理缺失或远处区域经常失败,而散焦模糊可以提供几何信息。我们提出BokehDepth,一个两阶段框架,将合成散焦视为无监督的几何信号。在第一阶段,一个物理基础的生成模型从单个清晰输入产生校准的散景堆栈,无需先验深度输入。随后,一个轻量级的散景感知聚合模块将这些堆栈集成到深度估计框架的编码器中。这种机制允许模型从散焦维度提取一致的几何特征,同时保持解码器架构不变。实验表明,与依赖深度的渲染基线相比,BokehDepth实现了优越的视觉散景保真度,并持续提升了最先进单目深度模型的度量精度。

英文摘要

Bokeh rendering and depth estimation share a fundamental optical connection, yet existing methods fail to fully exploit this reciprocity. Conventional bokeh pipelines rely heavily on noisy depth maps that inevitably introduce visual artifacts. Conversely, existing monocular depth models typically follow two flawed paradigms. Generative diffusion-based frameworks often lack consistent metric scale. Meanwhile, feed-forward metric depth models frequently fail in textureless or distant regions where defocus blur can provide geometric information. We propose BokehDepth, a two-stage framework that treats synthetic defocus as a supervision-free geometric signal. In the first stage, a physically grounded generative model produces calibrated bokeh stacks from a single sharp input without requiring prior depth input. Subsequently, a lightweight defocus-aware aggregation module integrates these stacks into the encoder of a depth estimation framework. This mechanism allows the model to extract consistent geometric features from the defocus dimension while keeping the decoder architecture unchanged. Experiments demonstrate that BokehDepth achieves superior visual bokeh fidelity compared to depth-dependent rendering baselines and consistently enhances the metric accuracy of state-of-the-art monocular depth models.

2512.05865 2026-05-26 cs.LG cs.AI

Intrinsically Interpretable Attention via Sparse Post-Training

通过稀疏后训练实现内在可解释的注意力机制

Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Schölkopf

发表机构 * MPI-IS(马克斯·普朗克研究所) University of Oxford(牛津大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出一种后训练方法,通过约束损失下的灵活稀疏正则化,在不牺牲性能的前提下将Transformer注意力连接稀疏至约0.4%,从而简化全局电路并提升可解释性。

详情
AI中文摘要

我们引入一种简单的后训练方法,使Transformer注意力变得稀疏而不牺牲性能。在约束损失目标下应用灵活的稀疏正则化,我们在高达7B参数的模型上证明,可以将注意力连接减少到其边缘的约0.4%,同时保留原始预训练损失。与为计算效率设计的稀疏注意力方法不同,我们的方法利用稀疏性作为结构先验:它保留了能力,同时暴露出更有组织和可解释的连接模式。我们发现这种局部稀疏性级联成全局电路简化:特定任务的电路涉及更少的组件(注意力头和MLP),连接它们的边缘减少了多达100倍。此外,使用跨层转录器,我们表明稀疏注意力显著简化了注意力归因,实现了基于特征和基于电路视角的统一视图。这些结果表明,Transformer注意力可以变得稀疏几个数量级,表明其大部分计算是冗余的,并且稀疏性可以作为更结构化和可解释模型的指导原则。

英文摘要

We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.4 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention attribution, enabling a unified view of feature-based and circuit-based perspectives. These results demonstrate that transformer attention can be made orders of magnitude sparser, suggesting that much of its computation is redundant and that sparsity may serve as a guiding principle for more structured and interpretable models.

2511.20236 2026-05-26 cs.AI cs.LG

Actionable and diverse counterfactual explanations incorporating domain knowledge and plausibility constraints

结合领域知识和可行性约束的可操作且多样化的反事实解释

Szymon Bobek, Łukasz Bałec, Grzegorz J. Nalepa

发表机构 * Faculty of Physics, Astronomy and Applied Computer Science, Institute of Applied Computer Science, Jagiellonian Human-Centered AI Lab(物理、天文与应用计算机科学学院,应用计算机科学研究所,雅盖隆人机中心AI实验室)

AI总结 提出DANCE方法,通过建模特征依赖和领域约束生成可操作、多样化的反事实解释,在OpenML数据集和工业邮件营销场景中验证了其有效性和实用性。

详情
AI中文摘要

反事实解释通过识别实现期望结果所需的最小变化来提高机器学习模型的可操作可解释性。然而,现有方法常常忽略特征之间的依赖关系,这可能导致不现实或不切实际的修改。这一限制降低了反事实解释在现实决策支持系统中的实用性。受网络安全中电子邮件营销应用的启发,我们提出了DANCE(多样化、可操作且知识约束的解释),一种生成反事实的方法,该方法结合了特征依赖和领域约束。DANCE使用线性或概率结构对特征之间的关系进行建模,这些结构可以从数据中学习或由专家指定。在搜索过程中强制执行这些依赖关系以提高可行性和现实性。该方法在一个统一的目标中联合优化可行性、多样性、邻近性和稀疏性。我们在OpenML的140个数据集上评估了DANCE,并证明它在多个评估标准上相比现有方法具有竞争性或更优的性能。此外,我们与一个电子邮件营销平台合作,在真实工业环境中验证了该方法,表明它能够产生符合领域且可操作的建议。

英文摘要

Counterfactual explanations improve the actionable interpretability of machine learning models by identifying minimal changes required to achieve a desired outcome. However, existing methods often neglect dependencies among features, which can lead to unrealistic or impractical modifications. This limitation reduces the usefulness of counterfactual explanations in real-world decision-support systems. Motivated by applications in cybersecurity for email marketing, we propose DANCE (Diverse, Actionable, and Knowledge-Constrained Explanations), a method for generating counterfactuals that incorporate feature dependencies and domain constraints. DANCE models relationships between features using linear and probabilistic structures that can be learned from data or specified by experts. These dependencies are enforced during the search process to improve plausibility and feasibility. The method jointly optimizes plausibility, diversity, proximity, and sparsity within a unified objective. We evaluate DANCE on 140 datasets from OpenML and demonstrate that it achieves competitive or superior performance compared to existing approaches across multiple evaluation criteria. Additionally, we validate the method in a real-world industrial setting in collaboration with an email marketing platform, showing that it produces domain-consistent and actionable recommendations.

2511.19065 2026-05-26 cs.CV cs.AI cs.LG

Understanding, Accelerating, and Improving MeanFlow Training

理解、加速和改进MeanFlow训练

Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek, Federico Tombari, Konrad Schindler, Dominik Narnhofer

发表机构 * Yonsei University(延世大学) ETH Zurich(苏黎世联邦理工学院) University of Zurich(苏黎世大学) Max Planck ETH CLS(马克斯·普朗克ETH CLS) Google(谷歌)

AI总结 通过分析瞬时速度与平均速度的相互作用,提出一种加速瞬时速度形成并逐步转移训练重点的有效训练方案,实现更快的收敛和更优的少步生成性能。

详情
AI中文摘要

MeanFlow通过联合学习瞬时速度场和平均速度场,有望在少步内实现高质量生成建模。然而,其底层训练动态仍不清楚。我们分析两种速度之间的相互作用,发现:(i) 建立良好的瞬时速度是学习平均速度的前提;(ii) 当时间间隔较小时,瞬时速度的学习受益于平均速度,但随着间隔增大而退化;(iii) 任务亲和性分析表明,对于一步生成至关重要的大间隔平均速度的平滑学习,依赖于先形成准确的瞬时速度和小间隔平均速度。在这些观察的指导下,我们设计了一种有效的训练方案,加速瞬时速度的形成,然后将重点从短间隔平均速度转移到长间隔平均速度。我们改进的MeanFlow训练实现了更快的收敛和显著更好的少步生成:使用相同的DiT-XL骨干网络,我们的方法在1-NFE ImageNet 256x256上达到了令人印象深刻的FID 2.87,而传统的MeanFlow基线为3.43。或者,我们的方法以2.5倍更短的训练时间或使用更小的DiT-L骨干网络,匹配MeanFlow基线的性能。

英文摘要

MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.

2511.03548 2026-05-26 cs.LG

Flat Minima and Generalization: Insights from Stochastic Convex Optimization

平坦极小值与泛化:来自随机凸优化的见解

Matan Schliserman, Shira Vansover-Hager, Tomer Koren

发表机构 * Blavatnik School of Computer Science and AI, Tel Aviv University(塔夫茨大学Blavatnik计算机科学与人工智能学院) Google Research(谷歌研究)

AI总结 本文在随机凸优化框架下研究平坦极小值与泛化的关系,发现平坦经验极小值可能产生Ω(1)的总体风险,而尖锐极小值泛化最优,并证明两种锐度感知算法(SA-GD和SAM)也可能泛化不佳。

详情
AI中文摘要

理解学习算法的泛化行为是学习理论的核心目标。最近一种新兴的解释是,学习算法在实践中成功是因为它们收敛到平坦极小值,而平坦极小值一直与改进的泛化性能相关联。在这项工作中,我们在非负、β-光滑目标的随机凸优化的经典设置中研究平坦极小值与泛化之间的联系。我们的第一个发现是,即使在这个基础且被充分研究的设置中,平坦的经验极小值可能产生平凡的Ω(1)总体风险,而尖锐极小值则能最优地泛化。然后,我们表明这种糟糕的泛化行为延伸到两种自然的“锐度感知”算法,这些算法最初由Foret等人(2021)提出,旨在将优化偏向平坦解:锐度感知梯度下降(SA-GD)和锐度感知最小化(SAM)。对于SA-GD,它在预定义邻域内对最大损失执行梯度步骤,我们证明虽然它成功以快速率收敛到平坦极小值,但解的总体风险仍然可能高达Ω(1),表明即使使用锐度感知梯度方法算法性地找到的平坦极小值也可能泛化不佳。对于SAM,一种基于归一化上升步骤的SA-GD计算高效近似,我们表明尽管它最小化经验损失,但可能收敛到尖锐极小值,并且也产生Ω(1)的总体风险。最后,我们使用算法稳定性技术为SA-GD和SAM建立了总体风险上界。

英文摘要

Understanding the generalization behavior of learning algorithms is a central goal of learning theory. A recently emerging explanation is that learning algorithms are successful in practice because they converge to flat minima, which have been consistently associated with improved generalization performance. In this work, we study the link between flat minima and generalization in the canonical setting of stochastic convex optimization with a non-negative, $β$-smooth objective. Our first finding is that, even in this fundamental and well-studied setting, flat empirical minima may incur trivial $Ω(1)$ population risk while sharp minima generalizes optimally. Then, we show that this poor generalization behavior extends to two natural ''sharpness-aware'' algorithms originally proposed by Foret et al. (2021), designed to bias optimization toward flat solutions: Sharpness-Aware Gradient Descent (SA-GD) and Sharpness-Aware Minimization (SAM). For SA-GD, which performs gradient steps on the maximal loss in a predefined neighborhood, we prove that while it successfully converges to a flat minimum at a fast rate, the population risk of the solution can still be as large as $Ω(1)$, indicating that even flat minima found algorithmically using a sharpness-aware gradient method might generalize poorly. For SAM, a computationally efficient approximation of SA-GD based on normalized ascent steps, we show that although it minimizes the empirical loss, it may converge to a sharp minimum and also incur population risk $Ω(1)$. Finally, we establish population risk upper bounds for both SA-GD and SAM using algorithmic stability techniques.

2511.03529 2026-05-26 cs.LG

Byzantine-Robust Federated Learning with Learnable Aggregation Weights

具有可学习聚合权重的拜占庭鲁棒联邦学习

Javad Parsa, Amir Hossein Daghestani, André M. H. Teixeira, Mikael Johansson

发表机构 * Uppsala University, Sweden(瑞典乌普萨拉大学) KTH, Sweden(瑞典皇家理工学院)

AI总结 提出一种将聚合权重作为可学习参数联合优化的拜占庭鲁棒联邦学习优化问题,并开发了交替最小化算法,在异构数据和恶意客户端场景下优于现有方法。

Comments ICLR 2026

详情
AI中文摘要

联邦学习(FL)使客户端能够在不共享私有数据的情况下协作训练全局模型。然而,恶意(拜占庭)客户端的存在对FL的鲁棒性构成了重大挑战,尤其是在客户端数据分布异构的情况下。在本文中,我们提出了一种新颖的拜占庭鲁棒FL优化问题,该问题将自适应加权引入聚合过程。与传统方法不同,我们的公式将聚合权重视为可学习参数,与全局模型参数联合优化。为了解决这个优化问题,我们开发了一种交替最小化算法,在对抗攻击下具有强收敛保证。我们分析了所提目标的拜占庭弹性。我们在各种数据集和攻击场景下,将我们的算法与最先进的拜占庭鲁棒FL方法进行了性能评估。实验结果表明,我们的方法始终优于现有方法,特别是在数据高度异构且恶意客户端比例较大的情况下。

英文摘要

Federated Learning (FL) enables clients to collaboratively train a global model without sharing their private data. However, the presence of malicious (Byzantine) clients poses significant challenges to the robustness of FL, particularly when data distributions across clients are heterogeneous. In this paper, we propose a novel Byzantine-robust FL optimization problem that incorporates adaptive weighting into the aggregation process. Unlike conventional approaches, our formulation treats aggregation weights as learnable parameters, jointly optimizing them alongside the global model parameters. To solve this optimization problem, we develop an alternating minimization algorithm with strong convergence guarantees under adversarial attack. We analyze the Byzantine resilience of the proposed objective. We evaluate the performance of our algorithm against state-of-the-art Byzantine-robust FL approaches across various datasets and attack scenarios. Experimental results demonstrate that our method consistently outperforms existing approaches, particularly in settings with highly heterogeneous data and a large proportion of malicious clients.

2510.23008 2026-05-26 cs.AI

From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports -- with Preliminary Extension to Lung Cancer

从提示优化到多维可信度评估:增强中文LLM生成的肝脏MRI报告的可信度——初步扩展至肺癌

Qiuli Wang, Xinhuang Sun, Yonglin Chen, Jie Cheng, Yongxu Liu, Xingpeng Zhang, Xiaoming Li, Wei Chen

发表机构 * Yu-Yue Pathology Research Center, Jinfeng Laboratory, Chongqing, China(渝粤病理研究所,金风实验室,重庆,中国) T Magnetic Resonance Imaging Translational Medical Center, Department of Radiology, Southwest Hospital, Army Medical University, Chongqing, China(7T磁共振成像转化医学中心,放射科,西南医院,中国人民解放军军医大学,重庆,中国)

AI总结 本研究提出多维可信度评估(MDCA)框架,并指导机构特定提示优化,以增强LLM生成的肝脏MRI报告的可信度,初步扩展至肺癌。

Comments 10 pages, 6 figures, 4 tables

详情
AI中文摘要

大型语言模型(LLMs)在从影像学发现生成诊断结论方面展现出有前景的性能,从而支持放射学报告、实习生教育和质量控制。然而,关于如何在不同临床背景下优化提示设计的系统指导仍未被充分探索。此外,评估LLM生成的放射学报告可信度的全面且标准化的框架尚未建立。本研究旨在通过引入多维可信度评估(MDCA)框架并提供机构特定提示优化的指导,增强LLM生成的肝脏MRI报告的可信度。所提出的框架被应用于评估和比较几个先进LLM的性能,包括Kimi-K2-Instruct-0905、Qwen3-235B-A22B-Instruct-2507、DeepSeek-V3和ByteDance-Seed-OSS-36B-Instruct,使用SiliconFlow平台。

英文摘要

Large language models (LLMs) have demonstrated promising performance in generating diagnostic conclusions from imaging findings, thereby supporting radiology reporting, trainee education, and quality control. However, systematic guidance on how to optimize prompt design across different clinical contexts remains underexplored. Moreover, a comprehensive and standardized framework for assessing the trustworthiness of LLM-generated radiology reports is yet to be established. This study aims to enhance the trustworthiness of LLM-generated liver MRI reports by introducing a Multi-Dimensional Credibility Assessment (MDCA) framework and providing guidance on institution-specific prompt optimization. The proposed framework is applied to evaluate and compare the performance of several advanced LLMs, including Kimi-K2-Instruct-0905, Qwen3-235B-A22B-Instruct-2507, DeepSeek-V3, and ByteDance-Seed-OSS-36B-Instruct, using the SiliconFlow platform.

2510.22874 2026-05-26 cs.CL

A Comprehensive Dataset for Human vs. AI Generated Text Detection

人类与AI生成文本检测的综合数据集

Rajarshi Roy, Gurpreet Singh, Ashhar Aziz, Shashwat Bajpai, Nasrin Imanpour, Shwetangshu Biswas, Kapil Wanaskar, Parth Patwa, Subhankar Ghosh, Shreyas Dixit, Nilesh Ranjan Pal, Vipula Rawte, Ritvik Garimella, Gaytri Jena, Amitava Das, Amit Sheth, Vasu Sharma, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha

发表机构 * Kalyani Government Engineering College(卡利尼政府工程学院) IIIT Guwahati(古瓦哈提理工学院) IIIT Delhi(德里理工学院) BITS Pilani Hyderabad Campus(比什帕利 Hyderabad 分校) University of South Carolina(南卡罗来纳大学) NIT Silchar(西里 char 工程学院) San José State University(桑乔斯州立大学) UCLA(加州大学洛杉矶分校) Washington State University(华盛顿州立大学) Vishwakarma Institute of Information Technology(维斯瓦卡arma 信息科技学院) Gandhi Institute for Technological Advancement(甘地技术进步研究所) BITS Pilani Goa(比什帕利 Goa 分校) Meta AI Amazon AI(亚马逊AI)

AI总结 本文提出了一个包含73,193个文本样本的综合数据集,结合真实纽约时报文章与多个先进LLM生成的合成文本,用于区分人类与AI生成文本及归因任务,基线准确率分别为58.35%和8.92%。

Comments Defactify4 @AAAI 2025

详情
AI中文摘要

大型语言模型(LLM)的快速发展使得AI生成的文本越来越像人类,引发了对内容真实性、错误信息和可信度的担忧。要可靠地检测AI生成文本并将其归因于特定模型,需要大规模、多样化且标注良好的数据集。在这项工作中,我们提出了一个包含73,193个文本样本的综合数据集,该数据集结合了真实的纽约时报文章与多个最先进LLM(包括Gemma-2-9b、Mistral-7B、Qwen-2-72B、LLaMA-8B、Yi-Large和GPT-4-o)生成的合成版本。数据集提供原始文章摘要作为提示,以及完整的人类作者叙述。我们为两个关键任务建立了基线结果:区分人类撰写与AI生成的文本,准确率达到58.35%;以及将AI文本归因于其生成模型,准确率为8.92%。通过将现实世界的新闻内容与现代生成模型相结合,该数据集旨在促进鲁棒的检测和归因方法的发展,在生成式AI时代培养信任和透明度。我们的数据集可在以下网址获取:https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Text_Dataset

英文摘要

The rapid advancement of large language models (LLMs) has led to increasingly human-like AI-generated text, raising concerns about content authenticity, misinformation, and trustworthiness. Addressing the challenge of reliably detecting AI-generated text and attributing it to specific models requires large-scale, diverse, and well-annotated datasets. In this work, we present a comprehensive dataset comprising over 73,193 text samples that combine authentic New York Times articles with synthetic versions generated by multiple state-of-the-art LLMs including Gemma-2-9b, Mistral-7B, Qwen-2-72B, LLaMA-8B, Yi-Large, and GPT-4-o. The dataset provides original article abstracts as prompts, full human-authored narratives. We establish baseline results for two key tasks: distinguishing human-written from AI-generated text, achieving an accuracy of 58.35\%, and attributing AI texts to their generating models with an accuracy of 8.92\%. By bridging real-world journalistic content with modern generative models, the dataset aims to catalyze the development of robust detection and attribution methods, fostering trust and transparency in the era of generative AI. Our dataset is available at: https://huggingface.co/datasets/Rajarshi-Roy-research/Defactify_Text_Dataset

2510.22827 2026-05-26 cs.CV cs.LG

FairJudge: Abstention-Aware Multimodal Judges for Fairness and Alignment Evaluation in Text-to-Image Models

FairJudge: 文本到图像模型中公平性与对齐评估的弃权感知多模态裁判

Zahraa Al Sahili, Maimuna Nowaz, Maryam Fetanat, Ioannis Patras, Matthew Purver

发表机构 * Queen Mary University of London(伦敦玛丽女王大学) Institut Jožef Stefan(乔泽夫·斯蒂芬研究所) Imperial College London(伦敦帝国学院)

AI总结 提出FairJudge协议,利用多模态大语言模型作为结构化裁判,通过封闭标签、弃权机制和证据报告,在文本到图像模型中实现社会属性预测、职业定位和提示-图像对齐的公平性评估。

详情
AI中文摘要

评估文本到图像(T2I)系统不仅需要判断图像是否匹配提示,还需要判断社会显著属性是否被忠实表示且没有无根据的推断。现有的自动评估器通常依赖于以面部为中心的识别器或对比图像-文本相似度,这些方法提供的诊断反馈有限,并且通常在视觉证据模糊或缺失时强制进行预测。对于宗教和残疾等公平敏感属性,其中线索可能是上下文相关的、间接的或故意未指定的,这些评估器可能会遗漏细心的人类评审员会注意到的失败模式。我们引入了\textsc{FairJudge},一种弃权感知的评估协议,该协议使用遵循指令的多模态LLM作为社会属性预测、职业定位和提示-图像对齐的结构化裁判。该协议将输出限制为封闭标签集,要求可见证据的理由,在线索不足时支持明确的\textsc{unspecified}决策,并将基于量规的对齐判断映射到$[-1,1]$。这些约束将MLLM裁判从开放式评估转变为可解析、可审计的评估程序。在四个属性预测基准和三个职业/对齐基准上,\textsc{FairJudge}优于或补充了CLIP、DeepFace、VIEScore和VQAScore。消融实验表明,封闭标签、弃权和证据报告对可靠性至关重要。我们进一步引入了\textsc{DIVERSIFY}和\textsc{DIVERSIFY-Professions},这两个资源丰富的上下文数据集用于评估超越面部可见或图标线索的社会表示和职业定位。我们发布了代码、提示、数据集、解析器日志和每张图像的裁判输出,以支持可重复的审计。

英文摘要

Evaluating text-to-image (T2I) systems requires judging not only whether an image matches a prompt, but also whether socially salient attributes are represented faithfully and without unsupported inference. Existing automated evaluators typically rely on face-centric recognizers or contrastive image--text similarity, which provide limited diagnostic feedback and often force predictions even when visual evidence is ambiguous or absent. For fairness-sensitive attributes such as religion and disability, where cues may be contextual, indirect, or intentionally unspecified, these evaluators can therefore miss failure modes that careful human reviewers would notice. We introduce \textsc{FairJudge}, an abstention-aware evaluation protocol that uses instruction-following multimodal LLMs as structured judges for social-attribute prediction, profession grounding, and prompt--image alignment. The protocol constrains outputs to closed label sets, requires visible-evidence rationales, supports an explicit \textsc{unspecified} decision when cues are insufficient, and maps rubric-based alignment judgments to $[-1,1]$. These constraints turn MLLM judging from open-ended assessment into a parseable, auditable evaluation procedure. Across four attribute-prediction benchmarks and three profession/alignment benchmarks, \textsc{FairJudge} outperforms or complements CLIP, DeepFace, VIEScore, and VQAScore. Ablations show that closed labels, abstention, and evidence reporting are central to reliability. We further introduce \textsc{DIVERSIFY} and \textsc{DIVERSIFY-Professions}, two context-rich resources for evaluating social representation and profession grounding beyond face-visible or iconic cues. We release code, prompts, datasets, parser logs, and per-image judge outputs to support reproducible auditing.

2510.22186 2026-05-26 cs.LG cs.IT math.FA math.IT math.MG

Quantitative Bounds for Sorting-Based Permutation-Invariant Embeddings

基于排序的置换不变嵌入的定量界

Nadav Dym, Matthias Wellershoff, Efstratios Tsoukanis, Daniel Levy, Radu Balan

发表机构 * Department of Mathematics, University of Maryland(马里兰大学数学系) Institute of Mathematical Sciences, Claremont Graduate University(克莱姆森研究生大学数学科学研究所)

AI总结 研究通过排序独立一维投影得到的置换不变嵌入,改进了注入性所需嵌入维度的上下界,并给出了双Lipschitz常数的估计,其失真度与点数n的平方成正比且与维度d无关。

Comments Minor revision; 37 pages, 1 figure, 2 tables

Journal ref IEEE Trans. Inf. Theory, vol. 72, no. 6, pp. 4297-4311, Jun. 2026

详情
AI中文摘要

我们研究$d$维点集的置换不变嵌入,这些嵌入通过排序输入数据的$D$个独立一维投影来定义。此类嵌入出现在图深度学习中对图节点输出应具有置换不变性的场景。先前的工作表明,对于足够大的$D$和处于一般位置的投影,该映射是单射的,并且满足双Lipschitz条件。然而,仍存在两个空白:首先,注入性所需的最优大小$D$尚不清楚;其次,映射的双Lipschitz常数估计未知。本文在解决这两个空白方面取得了实质性进展。针对第一个空白,我们改进了注入性所需嵌入维度$D$的最佳已知上界,并给出了最小注入性维度的下界。针对第二个空白,我们构造了投影向量矩阵,使得映射的双Lipschitz失真度与点数$n$的平方成正比,且完全独立于维度$d$。我们还证明,对于任何投影向量的选择,映射的失真度不会优于与$n$的平方根成比例的界。最后,我们展示了即使对映射应用线性投影以降低其维度,也能提供类似的保证。

英文摘要

We study permutation-invariant embeddings of $d$-dimensional point sets, which are defined by sorting $D$ independent one-dimensional projections of the input. Such embeddings arise in graph deep learning where outputs should be invariant to permutations of graph nodes. Previous work showed that for large enough $D$ and projections in general position, this mapping is injective, and moreover satisfies a bi-Lipschitz condition. However, two gaps remain: firstly, the optimal size $D$ required for injectivity is not yet known, and secondly, no estimates of the bi-Lipschitz constants of the mapping are known. In this paper, we make substantial progress in addressing both of these gaps. Regarding the first gap, we improve upon the best known upper bounds for the embedding dimension $D$ necessary for injectivity, and also provide a lower bound on the minimal injectivity dimension. Regarding the second gap, we construct matrices of projection vectors, so that the bi-Lipschitz distortion of the mapping depends quadratically on the number of points $n$, and is completely independent of the dimension $d$. We also show that for any choice of projection vectors, the distortion of the mapping will never be better than a bound proportional to the square root of $n$. Finally, we show that similar guarantees can be provided even when linear projections are applied to the mapping to reduce its dimension.

2510.22143 2026-05-26 cs.CL

Benchmarking and Learning Real-World Customer Service Dialogue

基准测试与学习真实世界客服对话

Tianhong Gao, Jundong Shen, Jiapeng Wang, Bei Shi, Ying Ju, Junfeng Yao, Huiyu Yu

发表机构 * ByteDance, Beijing, China(字节跳动,北京,中国)

AI总结 针对工业智能客服与真实对话需求脱节的问题,提出OlaBench基准和OlaMind模型,通过蒸馏专家推理模式与分阶段强化学习,在OlaBench上超越GPT-5.2和Gemini 3 Pro,在线A/B测试中问题解决率提升23.67%,人工转接率降低6.6%。

详情
AI中文摘要

现有的工业智能客服(ICS)基准和训练流程与真实对话需求仍存在偏差,过度强调可验证的任务成功,而低估了主观服务质量和实际故障模式,导致离线收益与可部署对话行为之间存在差距。我们通过一个从基准到优化的循环来弥合这一差距:首先引入OlaBench,一个涵盖检索增强生成、基于工作流的系统和智能体设置的ICS基准,评估服务能力、安全性和延迟敏感性;此外,受OlaBench结果显示最先进的LLM仍不足的启发,我们提出OlaMind,它从专家对话中提炼可复用的推理模式和服务策略,并应用分阶段探索-利用强化学习,结合实例级评分感知指导来提升模型能力。OlaMind在OlaBench上超越了GPT-5.2和Gemini 3 Pro(83.64 vs. 70.58/70.84),并且在在线A/B测试中,与基线相比,平均问题解决率提高了23.67%,人工转接率降低了6.6%,从而将离线收益桥接到部署中。OlaBench和OlaMind共同推动ICS系统向更拟人化、专业化和可靠的方向发展。项目页面和评估可在https://olamind-olabench.github.io获取。

英文摘要

Existing benchmarks and training pipelines for industrial intelligent customer service (ICS) remain misaligned with real-world dialogue requirements, overemphasizing verifiable task success while under-measuring subjective service quality and realistic failure modes, leaving a gap between offline gains and deployable dialogue behavior. We close this gap with a benchmark-to-optimization loop: we first introduce OlaBench, an ICS benchmark spanning retrieval-augmented generation, workflow-based systems, and agentic settings, which evaluates service capability, safety, and latency sensitivity; moreover, motivated by OlaBench results showing state-of-the-art LLMs still fall short, we propose OlaMind, which distills reusable reasoning patterns and service strategies from expert dialogues and applies staged exploration--exploitation reinforcement learning with instance-level rubric-aware guidance to improve model capability. OlaMind surpasses GPT-5.2 and Gemini 3 Pro on OlaBench (83.64 vs. 70.58/70.84) and, in online A/B tests, delivers an average +23.67% issue resolution and -6.6% human transfer rate versus the baseline, bridging offline gains to deployment. Together, OlaBench and OlaMind advance ICS systems toward more anthropomorphic, professional, and reliable deployment. The project page and evaluation are available at https://olamind-olabench.github.io.

2510.20390 2026-05-26 cs.RO

NeuralTouch: Neural Descriptors for Precise Sim-to-Real Tactile Robot Control

NeuralTouch: 用于精确的仿真到现实触觉机器人控制的神经描述符

Yijiong Lin, Bowen Deng, Keju Pu, Chenghua Lu, Max Yang, Efi Psomopoulou, Nathan F. Lepora

发表机构 * Department of Engineering Mathematics and Bristol Robotics Laboratory(工程数学系和布里斯托尔机器人实验室)

AI总结 提出NeuralTouch多模态框架,结合神经描述符场(NDF)和触觉感知,通过深度强化学习策略利用触觉反馈优化抓取姿态,实现精确且可泛化的机器人操作。

Journal ref IEEE/ASME Transactions on Mechatronics, 2026 IEEE/ASME Transactions on Mechatronics IEEE/ASME Transactions on Mechatronics

详情
AI中文摘要

抓取精度是精确物体操作的关键前提,通常需要机器人手与物体之间的仔细对齐。神经描述符场(NDF)提供了一种有前景的基于视觉的方法,能够生成跨物体类别泛化的抓取姿态。然而,由于相机标定不完美、点云不完整以及物体变异性,仅靠NDF可能产生不准确的姿态。同时,触觉感知能够实现更精确的接触,但现有方法通常学习仅限于简单、预定义接触几何的策略。在这项工作中,我们引入了NeuralTouch,一个集成NDF和触觉感知的多模态框架,通过轻柔的物理交互实现精确、可泛化的抓取。我们的方法利用NDF隐式表示目标接触几何,从中训练深度强化学习(RL)策略,利用触觉反馈来优化抓取。该策略以神经描述符为条件,不需要显式指定接触类型。我们通过仿真中的消融研究以及零样本迁移到真实世界的操作任务(如销钉出孔和瓶盖打开)来验证NeuralTouch,无需额外微调。结果表明,NeuralTouch在抓取精度和鲁棒性上显著优于基线方法,为精确、富接触的机器人操作提供了一个通用框架。

英文摘要

Grasping accuracy is a critical prerequisite for precise object manipulation, often requiring careful alignment between the robot hand and object. Neural Descriptor Fields (NDF) offer a promising vision-based method to generate grasping poses that generalize across object categories. However, NDF alone can produce inaccurate poses due to imperfect camera calibration, incomplete point clouds, and object variability. Meanwhile, tactile sensing enables more precise contact, but existing approaches typically learn policies limited to simple, predefined contact geometries. In this work, we introduce NeuralTouch, a multimodal framework that integrates NDF and tactile sensing to enable accurate, generalizable grasping through gentle physical interaction. Our approach leverages NDF to implicitly represent the target contact geometry, from which a deep reinforcement learning (RL) policy is trained to refine the grasp using tactile feedback. This policy is conditioned on the neural descriptors and does not require explicit specification of contact types. We validate NeuralTouch through ablation studies in simulation and zero-shot transfer to real-world manipulation tasks--such as peg-out-in-hole and bottle lid opening--without additional fine-tuning. Results show that NeuralTouch significantly improves grasping accuracy and robustness over baseline methods, offering a general framework for precise, contact-rich robotic manipulation.

2510.15264 2026-05-26 cs.CV

DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion

DriveGen3D: 通过高效视频扩散提升前馈驾驶场景生成

Weijie Wang, Jiagang Zhu, Zeyu Zhang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Haoxiao Wang, Guan Huang, Xinze Chen, Yukun Zhou, Wenkang Qin, Duochao Shi, Haoyun Li, Yicheng Xiao, Donny Y. Chen, Jiwen Lu

发表机构 * Zhejiang University(浙江大学) GigaAI Tsinghua University(清华大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Peking University(北京大学) Monash University(墨尔本大学)

AI总结 提出DriveGen3D框架,结合快速视频扩散Transformer(FastDrive-DiT)和前馈3D重建模块(FastRecon3D),实现高质量、可控的动态3D驾驶场景生成,在长视频和3D一致性上达到最优。

Comments ICME 2026 Oral, Project Page: https://lhmd.top/drivegen3d

详情
AI中文摘要

我们提出了DriveGen3D,一个用于生成高质量、高可控性动态3D驾驶场景的新框架,解决了现有方法的关键局限性。当前的驾驶场景合成方法要么因扩展时间生成而面临高昂的计算需求,要么专注于没有3D表示的长时间视频合成,或者局限于静态单场景重建。我们的工作通过多模态条件控制,将加速的长期视频生成与大规模动态场景重建相结合,弥合了这一方法论差距。DriveGen3D引入了一个由两个专门组件组成的统一流程:FastDrive-DiT,一个高效的视频扩散Transformer,用于在文本和鸟瞰图(BEV)布局引导下进行高分辨率、时间连贯的视频合成;以及FastRecon3D,一个前馈模块,可快速构建跨时间的3D高斯表示,确保时空一致性。DriveGen3D能够生成长达$800\times424$、12 FPS的驾驶视频及相应的3D场景,在保持效率的同时取得了最先进的结果。

英文摘要

We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. DriveGen3D enable the generation of long driving videos (up to $800\times424$ at $12$ FPS) and corresponding 3D scenes, achieving state-of-the-art results while maintaining efficiency.

2510.14862 2026-05-26 cs.CV cs.DC

Multi-modal video data-pipelines for machine learning with minimal human supervision

最小人工监督的机器学习多模态视频数据管道

Mihai-Cristian Pîrvu, Marius Leordeanu

发表机构 * Institute of Mathematics of the Romanian Academy "Simion Stoilow"(罗马尼亚科学院数学研究所 "Simion Stoilow") Faculty of Automatic Control and Computer Science, National University of Science and Technology POLITEHNICA(自动控制与计算机科学系,波兰技术大学)

AI总结 提出一种全自动数据管道,利用预训练专家模型和程序化组合,在无需人工监督下融合多种视觉模态,并基于PHG-MAE模型实现高效蒸馏,以低参数(<1M)达到与300M参数模型竞争的性能,部署于实时语义分割和深度估计任务。

详情
AI中文摘要

现实世界本质上是多模态的。我们的工具以数字形式(如视频或声音)观察和拍摄其快照,但大部分信息丢失。同样,对于人类之间的动作和信息传递,语言被用作书面交流形式。传统上,机器学习模型是单模态的(例如,rgb -> 语义或文本 -> 情感分类)。最近的趋势走向双模态,其中图像和文本一起学习,然而,为了真正理解世界,我们需要整合所有这些独立的模态。在这项工作中,我们尝试使用很少或没有人工监督来结合尽可能多的视觉模态。为此,我们使用预训练专家模型和它们之间的程序化组合,在原始视频之上构建一个完全自主的数据管道,我们也将其开源。然后,我们利用PHG-MAE,一个专门设计用于利用多模态数据的模型。我们展示了这个模型被高效蒸馏成低参数(<1M)后,可以与约300M参数的模型竞争。我们将该模型部署在商品硬件上的手持设备或网络摄像头上,分析实时语义分割的用例。最后,我们使用相同的框架部署其他现成模型,如用于近实时深度估计的DPT。

英文摘要

The real-world is inherently multi-modal at its core. Our tools observe and take snapshots of it, in digital form, such as videos or sounds, however much of it is lost. Similarly for actions and information passing between humans, languages are used as a written form of communication. Traditionally, Machine Learning models have been unimodal (i.e. rgb -> semantic or text -> sentiment_class). Recent trends go towards bi-modality, where images and text are learned together, however, in order to truly understand the world, we need to integrate all these independent modalities. In this work we try to combine as many visual modalities as we can using little to no human supervision. In order to do this, we use pre-trained experts and procedural combinations between them on top of raw videos using a fully autonomous data-pipeline, which we also open-source. We then make use of PHG-MAE, a model specifically designed to leverage multi-modal data. We show that this model which was efficiently distilled into a low-parameter (<1M) can have competitive results compared to models of ~300M parameters. We deploy this model and analyze the use-case of real-time semantic segmentation from handheld devices or webcams on commodity hardware. Finally, we deploy other off-the-shelf models using the same framework, such as DPT for near real-time depth estimation.

2510.11296 2026-05-26 cs.CV cs.LG

$Δ\mathrm{Energy}$: Optimizing Energy Change During Vision-Language Alignment Improves both OOD Detection and OOD Generalization

$Δ\mathrm{Energy}$: 优化视觉-语言对齐过程中的能量变化提升OOD检测与OOD泛化

Lin Zhu, Yifeng Yang, Xinbing Wang, Qinying Gu, Nanyang Ye

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文提出ΔEnergy分数,通过重新对齐视觉-语言模态时的能量变化来同时提升分布外检测和分布外泛化性能,并基于此开发了统一微调框架EBM。

Comments Accepted by NeurIPS2025

详情
AI中文摘要

近期针对视觉-语言模型(VLM)的方法在下游任务快速适应中取得了显著成功。当应用于真实世界下游任务时,VLM不可避免地会遇到分布内(ID)数据和分布外(OOD)数据。OOD数据集通常包括协变量偏移(例如,已知类别但图像风格变化)和语义偏移(例如,测试时未见类别)。这凸显了提升VLM对协变量偏移OOD数据的泛化能力,同时有效检测开放集语义偏移OOD类别的重要性。本文受重新对齐视觉-语言模态时(具体通过将最大余弦相似度直接降低到低值)观察到的闭集数据中显著能量变化的启发,提出了一种新的OOD分数,命名为ΔEnergy。ΔEnergy显著优于基于能量的原始OOD分数,为OOD检测提供了更可靠的方法。此外,ΔEnergy还能同时提升协变量偏移下的OOD泛化,这是通过ΔEnergy的下界最大化(称为EBM)实现的。理论上证明EBM不仅能增强OOD检测,还能产生领域一致的Hessian矩阵,这作为OOD泛化的强指标。基于这一发现,我们开发了一个统一的微调框架,能够提升VLM在OOD泛化和OOD检测两方面的鲁棒性。在具有挑战性的OOD检测和泛化基准上的大量实验证明了我们方法的优越性,在AUROC上比近期方法提升了10%到25%。

英文摘要

Recent approaches for vision-language models (VLMs) have shown remarkable success in achieving fast downstream adaptation. When applied to real-world downstream tasks, VLMs inevitably encounter both the in-distribution (ID) data and out-of-distribution (OOD) data. The OOD datasets often include both covariate shifts (e.g., known classes with changes in image styles) and semantic shifts (e.g., test-time unseen classes). This highlights the importance of improving VLMs' generalization ability to covariate-shifted OOD data, while effectively detecting open-set semantic-shifted OOD classes. In this paper, inspired by the substantial energy change observed in closed-set data when re-aligning vision-language modalities (specifically by directly reducing the maximum cosine similarity to a low value), we introduce a novel OOD score, named ΔEnergy. ΔEnergy significantly outperforms the vanilla energy-based OOD score and provides a more reliable approach for OOD detection. Furthermore, ΔEnergy can simultaneously improve OOD generalization under covariate shifts, which is achieved by lower-bound maximization for ΔEnergy (termed EBM). EBM is theoretically proven to not only enhance OOD detection but also yields a domain-consistent Hessian, which serves as a strong indicator for OOD generalization. Based on this finding, we developed a unified fine-tuning framework that allows for improving VLMs' robustness in both OOD generalization and OOD detection. Extensive experiments on challenging OOD detection and generalization benchmarks demonstrate the superiority of our method, outperforming recent approaches by 10% to 25% in AUROC.

2510.10921 2026-05-26 cs.CV cs.AI cs.LG

FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

FG-CLIP 2: 一种双语细粒度视觉-语言对齐模型

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng, Yuhui Yin

发表机构 * AI Research(360人工智能研究院)

AI总结 提出FG-CLIP 2双语视觉语言模型,通过区域-文本匹配、长描述建模和文本内模态对比损失等细粒度监督,在英中双语上实现细粒度对齐,在29个数据集上取得最优结果。

Comments Accepted in ICML2026

详情
AI中文摘要

细粒度视觉-语言理解需要视觉内容与语言描述之间的精确对齐,这一能力在当前模型中仍然有限,尤其是在非英语环境下。虽然CLIP等模型在全局对齐上表现良好,但它们往往难以捕捉对象属性、空间关系和语言表达中的细粒度细节,且对双语理解的支持有限。为应对这些挑战,我们提出了FG-CLIP 2,一个旨在推进英语和中文细粒度对齐的双语视觉语言模型。我们的方法利用了丰富的细粒度监督,包括区域-文本匹配和长描述建模,以及多个判别性目标。我们进一步引入了文本内模态对比损失,以更好地区分语义相似的描述。在精心策划的大规模英语和中文数据混合上训练,包括新发布的1200万中文区域-文本数据集,FG-CLIP 2实现了强大的双语性能。为进行严格评估,我们提出了一个新的中文多模态理解基准,包括长描述检索和边界框分类。在8个任务的29个数据集上的大量实验表明,FG-CLIP 2优于现有方法,在两种语言上均达到了最先进的结果。我们发布了模型、代码和基准,以促进双语细粒度视觉-语言对齐的未来研究。

英文摘要

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, including a newly released 12M Chinese region-text dataset, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained vision-language alignment.

2510.08558 2026-05-26 cs.AI cs.CL cs.IR cs.LG

Agent Learning via Early Experience

通过早期经验进行智能体学习

Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu

发表机构 * Meta Superintelligence Labs(Meta超智能实验室) FAIR at Meta(Meta的FAIR部门) The Ohio State University(俄亥俄州立大学)

AI总结 提出早期经验范式,利用智能体自身动作生成的交互数据(无需奖励信号)通过隐式世界建模和自我反思两种策略提升智能体在多样化环境中的效果和跨域泛化能力。

Comments ICML 2026

详情
AI中文摘要

语言智能体的一个长期目标是通过自身经验学习和改进,最终在复杂的现实任务中超越人类。然而,在缺乏可验证奖励(如网站)或需要低效长程展开(如多轮工具使用)的许多环境中,基于经验数据使用强化学习训练智能体仍然困难。因此,当前大多数智能体依赖专家数据的监督微调,这难以扩展且泛化能力差。这一局限性源于专家示范的本质:它们只捕获了狭窄的场景范围,并使智能体暴露于有限的环境多样性。我们通过一种称为早期经验的中间范式来解决这一局限性:由智能体自身动作生成的交互数据,其中产生的未来状态作为监督信号,无需奖励。在此范式下,我们研究了使用此类数据的两种策略:(1)隐式世界建模,利用收集的状态将策略基于环境动态;(2)自我反思,智能体从其次优动作中学习以改进推理和决策。在八个多样化环境和多个模型家族上的评估表明,我们的方法持续提升了有效性和跨域泛化,凸显了早期经验的价值。此外,在具有可验证奖励的环境中,我们的结果提供了有希望的信号,表明早期经验为后续强化学习奠定了坚实基础,使其成为模仿学习与完全经验驱动智能体之间的实用桥梁。

英文摘要

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios, and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm, we study two strategies of using such data: (1) implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. Evaluation across eight diverse environments and multiple model families shows that our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, making it a practical bridge between imitation learning and fully experience-driven agents.

2510.08350 2026-05-26 cs.LG cs.AI

DeepEN: A Deep Reinforcement Learning Framework for Personalized Enteral Nutrition in Critical Care

DeepEN: 一种用于重症监护中个性化肠内营养的深度强化学习框架

Daniel Jason Tan, Jiayang Chen, Dilruk Perera, Kay Choong See, Mengling Feng

发表机构 * Institute of Data Science(数据科学研究所) Saw Swee Hock School of Public Health, National University of Singapore, Singapore(Saw Swee Hock公共卫生学院,新加坡国立大学,新加坡) National University Hospital, Singapore(新加坡国立医院)

AI总结 提出DeepEN框架,利用深度强化学习从电子健康记录中学习个性化肠内营养方案,在MIMIC-IV数据集上相比临床实践降低绝对死亡率4.0个百分点。

详情
AI中文摘要

目的:由于个性化程度有限以及在动态代谢需求下对适当热量、蛋白质和液体目标的不确定性,ICU中的肠内营养(EN)输送仍不理想。我们引入DeepEN,一个使用电子健康记录数据进行个性化EN优化的强化学习(RL)框架。方法:DeepEN在来自MIMIC-IV的超过11,000名ICU患者上训练,以生成每4小时一次、针对患者的卡路里、蛋白质和液体目标。状态表示包括人口统计学、合并症、生命体征、实验室值和近期干预措施。一个生理学对齐的奖励框架平衡了生物标志物稳定性与长期生存。策略学习采用带有保守Q学习正则化的决斗双深度Q网络,以实现安全的离线训练。结果:DeepEN实现了最高的估计策略价值($V^π= 9.48$)和最低的校准死亡率(18.8 ± 1.0%),与临床实践(22.8%)相比绝对降低了4.0个百分点。该策略还表现出优越的代谢稳定性,实现了目标范围内葡萄糖、磷酸盐和钠值的最高比例。此外,偏离DeepEN策略与死亡率和生物标志物不稳定性独立相关,而偏离随机策略则没有这种关联。可解释性分析进一步表明,建议是基于器官功能和代谢状态的生理相关标志物,而不是静态剂量启发式。结论:DeepEN证明了保守离线RL在安全、个性化EN优化中的可行性,突出了数据驱动个性化在重症监护中补充基于指南方法的潜力。

英文摘要

Objective: Enteral nutrition (EN) delivery in the ICU remains suboptimal due to limited personalization and uncertainty regarding appropriate calorie, protein, and fluid targets under dynamic metabolic demands. We introduce DeepEN, a reinforcement learning (RL) framework for personalized EN optimization using electronic health record data. Methods: DeepEN was trained on over 11,000 ICU patients from MIMIC-IV to generate 4-hourly, patient-specific caloric, protein, and fluid targets. The state representation incorporated demographics, comorbidities, vital signs, laboratory values, and recent interventions. A physiologically aligned reward framework balanced biomarker stability with long-term survival. Policy learning employed a dueling double deep Q-network with Conservative Q-Learning regularization to enable safe offline training. Results: DeepEN achieved the highest estimated policy value ($V^π= 9.48$) and the lowest calibrated mortality (18.8 +/- 1.0%), representing a 4.0 percentage-point absolute reduction compared with clinician practice (22.8%). The policy also demonstrated superior metabolic stability, achieving the highest proportion of glucose, phosphate, and sodium values within target range. Furthermore, deviation from the DeepEN policy was independently associated with increased mortality and biomarker instability, whereas deviation from a random policy showed no such association. Interpretability analyses further indicated that recommendations were conditioned on physiologically relevant markers of organ function and metabolic status rather than static dosing heuristics. Conclusion: DeepEN demonstrates the feasibility of conservative offline RL for safe, individualized EN optimization, highlighting the potential of data-driven personalization to complement guideline-based approaches in critical care.

2510.06672 2026-05-26 cs.LG

XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

XRPO:通过定向探索与利用突破GRPO极限

Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, Fan Lai

发表机构 * University of Illinois Urbana–Champaign(伊利诺伊大学厄巴纳-香槟分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出XRPO框架,通过自适应探索分配器、上下文种子策略和新颖性感知优势机制,在数学和编码基准上实现比GRPO最高4% pass@1和6% cons@32的提升,并加速训练收敛达2.7倍。

详情
AI中文摘要

GRPO等强化学习算法推动了大型语言模型推理的最新进展。虽然增加rollout数量可以稳定训练,但现有方法在具有挑战性的提示上探索有限,且由于跨提示的上下文无关rollout分配(例如,每个提示生成16个rollout)以及严重依赖稀疏奖励,导致信息性反馈信号未被充分利用。本文提出XRPO(探索-利用GRPO),这是一个统一框架,通过rollout探索-利用的原则性视角重新审视策略优化。为增强探索,XRPO引入了一个数学基础的rollout分配器,自适应地优先处理具有更高不确定性减少潜力的提示。它还通过上下文种子策略注入精选示例,解决零奖励提示上的停滞问题,引导模型进入更困难的推理轨迹。为加强利用,XRPO开发了一种组相对、新颖性感知的优势锐化机制,利用序列似然性放大低概率但正确的响应,从而将策略扩展到稀疏奖励之外。在多种数学和编码基准上对推理和非推理模型的实验表明,XRPO优于现有先进方法(如GRPO和GSPO),pass@1提升高达4%,cons@32提升高达6%,同时训练收敛速度加快达2.7倍。

英文摘要

Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited, due to context-independent rollout allocation across prompts (e.g., generating 16 rollouts per prompt) and relying heavily on sparse rewards. This paper presents XRPO(eXplore - eXploit GRPO), a unified framework that recasts policy optimization through the principled lens of rollout exploration-exploitation. To enhance exploration, XRPO introduces a mathematically grounded rollout allocator that adaptively prioritizes prompts with higher potential for uncertainty reduction. It further addresses stagnation on zero-reward prompts through an in-context seeding strategy that injects curated exemplars, steering the model into more difficult reasoning trajectories. To strengthen exploitation, XRPO develops a group-relative, novelty-aware advantage sharpening mechanism that leverages sequence likelihoods to amplify low-probability yet correct responses, thereby extending the policy's reach beyond sparse rewards. Experiments across diverse math and coding benchmarks on both reasoning and non-reasoning models demonstrate that XRPO outperforms existing advances (e.g., GRPO and GSPO) up to 4% pass@1 and 6% cons@32, while accelerating training convergence by up to 2.7X.

2510.06351 2026-05-26 cs.RO

A Formal gatekeeper Framework for Safe Dual Control with Active Exploration

具有主动探索的安全双重控制的正式门控框架

Kaleb Ben Naveed, Devansh R. Agrawal, Dimitra Panagou

发表机构 * Department of Robotics, University of Michigan(密歇根大学机器人系) Department of Aerospace Engineering, University of Michigan(密歇根大学航空航天工程系)

AI总结 提出一个集成鲁棒规划与主动探索的框架,通过门控机制仅在可验证改进且不牺牲安全时进行探索,实现安全与不确定性降低的平衡。

Comments Accepted at American Control Conference (ACC) 2026

详情
AI中文摘要

在模型不确定性下规划安全轨迹是一个基本挑战。鲁棒规划通过考虑最坏情况来确保安全,但忽略了不确定性降低,导致过于保守的行为。在名义任务期间主动实时降低不确定性定义了双重控制问题。大多数方法通过在成本中添加加权探索项来解决这一问题,调整以平衡名义目标和不确定性降低,但没有正式考虑何时探索是有益的。此外,某些方法强制安全性,而其他方法则没有。我们提出了一个框架,将鲁棒规划与正式保证下的主动探索集成如下:关键创新和贡献在于,仅在探索提供可验证改进且不牺牲安全时才进行探索。为实现这一点,我们利用我们早期关于门控器作为安全验证架构的工作,并将其扩展,使其生成既安全又信息丰富的轨迹,从而降低不确定性和任务成本,或将其保持在用户定义的预算内。通过参数不确定性下四旋翼飞行器在线双重控制的仿真案例研究评估了该方法。

英文摘要

Planning safe trajectories under model uncertainty is a fundamental challenge. Robust planning ensures safety by considering worst-case realizations, yet ignores uncertainty reduction and leads to overly conservative behavior. Actively reducing uncertainty on-the-fly during a nominal mission defines the dual control problem. Most approaches address this by adding a weighted exploration term to the cost, tuned to trade off the nominal objective and uncertainty reduction, but without formal consideration of when exploration is beneficial. Moreover, safety is enforced in some methods but not in others. We propose a framework that integrates robust planning with active exploration under formal guarantees as follows: The key innovation and contribution is that exploration is pursued only when it provides a verifiable improvement without compromising safety. To achieve this, we utilize our earlier work on gatekeeper as an architecture for safety verification, and extend it so that it generates both safe and informative trajectories that reduce uncertainty and the cost of the mission, or keep it within a user-defined budget. The methodology is evaluated via simulation case studies on the online dual control of a quadrotor under parametric uncertainty.

2510.05688 2026-05-26 cs.LG cs.AI

vAttention: Verified Sparse Attention

vAttention: 验证的稀疏注意力

Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

发表机构 * Electrical Engineering and Computer Sciences, University of California, Berkeley(加州大学伯克利分校电气工程与计算机科学系)

AI总结 提出vAttention,通过统一top-k和随机采样,实现首个具有用户指定(ε, δ)近似精度保证的实用稀疏注意力机制,显著提升质量-效率权衡。

Journal ref Proceedings of the International Conference on Learning Representations (ICLR), 2026

详情
AI中文摘要

最先进的用于减少解码延迟的稀疏注意力方法主要分为两类:近似top-$k$(及其扩展top-$p$)和最近引入的基于采样的估计。然而,这些方法在逼近全注意力方面存在根本性局限:它们无法在头和查询向量之间提供一致的近似,最关键的是,缺乏对近似质量的保证,限制了其实际部署。我们观察到top-$k$和随机采样是互补的:当注意力分数由少数标记主导时,top-$k$表现良好,而当注意力分数相对均匀时,随机采样提供更好的估计。基于这一洞察并利用采样的统计保证,我们引入了vAttention,这是第一个具有用户指定$(ε, δ)$近似精度保证(因此称为“已验证”)的实用稀疏注意力机制。这些保证使vAttention成为向大规模实用、可靠部署稀疏注意力迈出的引人注目的一步。通过统一top-$k$和采样,vAttention在质量-效率权衡上优于两者各自的表现。我们的实验表明,vAttention显著提高了稀疏注意力的质量(例如,在RULER-HARD上,Llama 3.1 8B Instruct和DeepSeek-R1-Distill-Llama-8B提高了约4.5个百分点),并有效弥合了全注意力和稀疏注意力之间的差距(例如,在多个数据集上,以高达20倍稀疏度匹配全模型质量)。我们还展示了它可以部署在推理场景中,在不牺牲模型质量的情况下实现快速解码(例如,vAttention在AIME2024上以10倍稀疏度和高达32K标记生成实现了全模型质量)。代码:https://github.com/skylight-org/sparse-attention-hub。网页:https://sky-light.eecs.berkeley.edu。

英文摘要

State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(ε, δ)$ guarantees on approximation accuracy (thus, "verified"). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-$k$ and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $\sim$4.5 percentage points for Llama 3.1 8B Instruct and DeepSeek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with up to 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code: https://github.com/skylight-org/sparse-attention-hub. Webpage: https://sky-light.eecs.berkeley.edu.