arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2605.26321 2026-05-27 cs.AI

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

Anchor:缓解智能体基准生成中的工件漂移

Maksim Ivanov, Abhijay Rana

发表机构 * Agentic Labs

AI总结 提出Anchor管道,通过约束优化程序联合生成指令、环境、真值解和验证器,解决基准生成中的工件漂移问题,并构建ERP-Bench基准评估前沿模型性能。

Comments Accepted to RLEval '26 (Workshop at ACM Conference on AI and Agentic Systems 2026)

详情
AI中文摘要

AI智能体开始完成有价值的、长期的企业运营任务,但企业工作的训练和评估环境仍然难以平衡真实性、可验证性和规模。环境和任务创建经常遭受一种我们称之为工件漂移的失败模式:当指令、环境、预言机和验证器由松散耦合的过程创建时,它们经常对任务要求产生分歧,产生不可解、可奖励黑客或不一致的环境。我们引入Anchor,一个任务生成管道,将领域专家对业务工作流的规范形式化为约束优化程序。从单一参数化规范出发,该管道联合生成自然语言指令、环境配置、求解器认证的真实解和基于状态的验证器。使用Anchor,改变参数会产生具有可控难度和已知最优解的新任务,产生与框架无关的环境,其奖励仅取决于最终状态的业务正确性。我们应用Anchor生成ERP-Bench:一个包含300个长期任务的基准,涵盖生产级ERP系统中的采购和制造工作流。我们发现生成参数可预测实际难度,前沿模型在26.1%的试验中满足显式任务约束,但仅在17.4%的试验中达到完全最优解。总体而言,我们表明Anchor和ERP-Bench为构建具有经济价值的智能体工作的可审计评估环境提供了具体方法。我们在erpbench.ai发布任务生成器和ERP-Bench数据集。

英文摘要

AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale. Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward-hackable, or inconsistent. We introduce Anchor, a task-generation pipeline that formalizes domain experts' specifications of business workflows into constraint optimization programs. From a single parametric specification, the pipeline jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness. We apply Anchor to produce ERP-Bench: a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows in a production-grade ERP system. We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials. Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work. We release the task generator and ERP-Bench dataset at erpbench.ai

2605.26320 2026-05-27 cs.LG cs.CL

MULTISEISMO: A Multimodal Seismic Dataset and Model for Cross-Modal Seismic Understanding

MULTISEISMO: 面向跨模态地震理解的多模态地震数据集与模型

Sai Munikoti, Ian Stewart, Chengping Chai, Lisa Linville, Scott Vasquez, Sameera Horawalavithana, Karl Pazdernik

发表机构 * Pacific Northwest National Laboratory(太平洋西北国家实验室) Oak Ridge National Laboratory(橡树岭国家实验室) Sandia National Laboratory(桑迪亚国家实验室) North Carolina State University(北卡罗来纳州立大学)

AI总结 针对地震学中多模态数据整合的缺失,构建了包含超过1.6万次地震事件的结构化多模态数据集MultiSeismo,并开发了专用多模态模型SeisModal,在跨模态地震推理任务上取得了优越性能。

详情
AI中文摘要

通用多模态模型(GMMs)在专业科学领域的应用仍然有限,原因是缺乏整合文本和图像之外多种数据模态的综合性领域特定数据集。在地震学中,理解地震现象需要综合时间序列波形数据、地理图像和上下文元数据,而现有地震数据集缺乏这种多模态整合。我们提出了MultiSeismo,一个大规模结构化多模态地震数据集,包含跨越13年(2010年至2023年)来自不同地理区域的超过1.6万次地震事件。每个事件数据整合了全球台网波形记录、烈度图、人口暴露可视化以及标准JSON格式的全面文本描述。此外,我们开发了MISCE,一个基于原始数据的多模态指令集,用于对GMMs进行监督训练和评估,涵盖从基本信息检索到复杂跨模态分析的地震推理任务。我们利用MISCE微调了一个现有的多模态模型(Unified IO 2),并增强了专门的时间序列编码器,从而得到了SeisModal——首个用于综合地震分析的领域特定多模态模型。在MultiSeismo上对最先进的多模态模型进行评估,揭示了显著挑战,特别是通用模型在处理时间序列数据方面的困难,同时证明了SeisModal在地震多模态推理任务上的优越性能。这些结果证明,MultiSeismo为未来地震学多模态研究提供了严格的基准,并验证了我们领域特定架构调整的成功。

英文摘要

The application of generalist multimodal models (GMMs) to specialized scientific domains remains limited due to the scarcity of comprehensive domain-specific datasets that integrate multiple data modalities beyond text and images. In seismology, understanding earthquake phenomena requires the synthesis of timeseries waveform data, geographical imagery, and contextual metadata, a multimodal integration absent in existing seismic datasets. We present MultiSeismo, a large scale structured multimodal seismic dataset, comprising over 16K seismic events spanning 13 years (2010 to 2023) across diverse geographical regions. Each event data integrates waveform recordings from global station networks, intensity maps, population exposure visualizations, and a comprehensive textual description within a standardized JSON format. We additionally develop MISCE, a multimodal instruction set on top of raw data to enable supervised training and evaluation of GMMs on seismic reasoning tasks ranging from basic information retrieval to complex cross modal analysis. We leverage MISCE to finetune an existing multimodal model (Unified IO 2) enhanced with a specialized timeseries encoder, which yields SeisModal, the first domain specific multimodal model for comprehensive seismic analysis. Evaluation of state of the art multimodal models on MultiSeismo reveals significant challenges, particularly with time-series data processing for general purpose models, while demonstrating SeisModal's superior performance on seismic multimodal reasoning tasks. These results prove that MultiSeismo provides a rigorous benchmark for future multimodal research in seismology and validate the success of our domain specific architectural adaptations.

2605.26316 2026-05-27 cs.CV cs.AI

E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

E$^3$C: 具有3D环境记忆和自我-外部人体姿态控制的视频生成

Qiao Gu, Lingni Ma, Adam W Harley, Richard Newcombe, Florian Shkurti, Julian Straub

发表机构 * Meta Reality Labs(Meta现实实验室) University of Toronto(多伦多大学)

AI总结 提出E$^3$C可控视频扩散框架,通过3D点云记忆和双通道人体控制(自我与外骨骼),实现物理一致的自我中心视频生成。

Comments Preprint. Project Page: https://e3c-videogen.github.io/

详情
AI中文摘要

可控且物理合理的自我中心视频生成对于具身智能体推理自身及他人动作如何表现和改变世界至关重要。与通用视频合成相比,自我中心生成尤其具有挑战性:相机与演员紧密耦合,导致视角快速变化和频繁的自遮挡;底层动作细微、关节化,且通常仅部分可见;人和场景状态必须与指定控制一致地演化。我们提出E$^3$C,一种用于自我中心生成的可控视频扩散框架,构建结构化和紧凑的条件,将持久场景结构与人类驱动动态分离。从上下文帧中,E$^3$C构建基于半稠密点云的3D记忆,并用来自视频VAE特征的外观描述符增强每个点。将此记忆渲染到目标视角产生与目标帧对齐的条件。人类动态单独建模。场景中观察到的人由骨架渲染(外部人体控制)控制,而相机佩戴者由其3D身体关节和6DoF手腕运动(自我人体控制)指定。为了在佩戴者身体部位不可见时保持自我人体控制,我们引入了一个自我运动编码器,生成持久的交叉注意力标记。在Nymeria上的实验表明,E$^3$C在视觉保真度、相机运动准确性、物体一致性以及自我和外部人体控制方面优于强基线,同时还能实现直观的场景编辑。

英文摘要

Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others' actions manifest and change the world. Compared to generic video synthesis, egocentric generation is especially challenging: the camera is tightly coupled to the actor, leading to rapid viewpoint changes and frequent self-occlusions; the underlying actions are subtle, articulated, and often only partially visible; and both the people and the scene state must evolve consistently with the specified controls. We present E$^3$C, a controllable video diffusion framework for egocentric generation that builds structured and compact conditions disentangling persistent scene structure from human-driven dynamics. From context frames, E$^3$C constructs a semi-dense point cloud-based 3D memory and augments each point with appearance descriptors from video-VAE features. Rendering this memory into target viewpoints produces conditioning aligned with the target frames. Human dynamics are modeled separately. The observed people in the scene are controlled by skeleton renderings (exo human control), while the camera wearer is specified by their 3D body joints and 6DoF wrist motion (ego human control). To preserve ego human control when the wearer's body parts are invisible, we introduce an ego motion encoder that produces persistent cross-attention tokens. Experiments on Nymeria show that E$^3$C improves visual fidelity, camera-motion accuracy, object consistency, and ego & exo human control over strong baselines, while also enabling intuitive scene editing.

2605.26315 2026-05-27 cs.LG cs.AI

Curriculum Learning for Safety Alignment

用于安全对齐的课程学习

Sandeep Kumar, Virginia Smith, Chhavi Yadav

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Simons Institute, UC Berkeley(Simons研究所,伯克利大学)

AI总结 提出基于课程学习的Staged-Competence框架,通过难度分级的偏好数据和渐进式参考模型更新,提升DPO安全对齐的鲁棒性,在三个模型族上平均降低16%的OOD有害响应率和20%的越狱攻击成功率。

Comments Accepted at the ICML 2026 GlobalSouthML Workshop

详情
AI中文摘要

直接偏好优化(DPO)广泛用于大型语言模型的安全对齐。然而,先前的工作表明它脆弱且表现出较差的分布外(OOD)泛化能力。在本文中,我们研究课程学习是否能提高基于DPO的安全对齐的鲁棒性。我们提出Staged-Competence,一个基于课程的框架,它按难度组织偏好数据,采用基于能力的采样,并在训练过程中逐步更新参考模型。在三个模型族上平均,Staged-Competence将OOD有害响应率降低16%,越狱攻击成功率降低20%,同时保持接近零的过度拒绝,保留通用能力。我们进一步表明,Staged-Competence(1)仅使用75%的训练数据即可达到基线安全性,(2)在安全与不安全响应之间产生更好的分离。Staged-Competence与策略优化损失无关,并可扩展到其他DPO变体和对齐领域。我们的代码和数据可在https://github.com/Sandeep5500/curriculum-learning-for-safety获取。

英文摘要

Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate whether Curriculum Learning can improve the robustness of DPO-based safety alignment. We propose Staged-Competence, a curriculum-based framework that organises preference data by difficulty, employs competence-based sampling, and progressively updates the reference model during training. Averaged across three model families, Staged-Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities with near-zero over-refusal. We further show that Staged-Competence (1) matches baseline safety with only 75% of the training data and (2) yields better separation between safe and unsafe responses. Staged-Competence is agnostic to the policy optimisation loss and can extend to other DPO variants and alignment domains. Our code and data are available at https://github.com/Sandeep5500/curriculum-learning-for-safety.

2605.26302 2026-05-27 cs.AI cs.CL cs.MA

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

你的智能体也在老化:面向部署系统的智能体寿命工程

Jianing Zhu, Yeonju Ro, John Robertson, Kevin Wang, Junbo Li, Haris Vikalo, Aditya Akella, Zhangyang Wang

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出 AgingBench 基准,通过四种老化机制和诊断工具评估部署后智能体的可靠性退化,并指出需要寿命评估、机制级诊断和阶段针对性修复。

详情
AI中文摘要

长寿命AI智能体越来越多地被部署为持久化运行系统,但它们仍然像刚初始化的模型一样被评估。第一天基准测试忽略了一个基本系统问题:智能体在部署后能保持可靠多久?即使模型权重被冻结,智能体的有效状态也在不断变化,因为它压缩交互历史、从不断增长的记忆库中检索、在更新后修正事实,并经历常规维护。因此,可靠性成为整个智能体框架的寿命属性,而不仅仅是基础模型的快照属性。我们引入了AgingBench,一个用于智能体寿命工程的纵向可靠性基准:不仅测量部署的智能体是否退化,还测量退化的形式以及修复应针对何处。AgingBench将智能体老化组织为四种机制:压缩老化、干扰老化、修订老化和维护老化。为了诊断这些故障,AgingBench使用时间依赖图和对偶反事实探针,为记忆管道的写入、检索和利用阶段生成诊断档案。在7个场景、14个模型、多种记忆策略以及运行者控制和自主智能体上,跨越约400次运行(涵盖8到200个会话)的结果表明,智能体老化不是一维的:行为测试可以保持干净,而事实精度下降;派生状态跟踪可能在单个模型内急剧崩溃;相同的错误答案可能需要不同的修复,具体取决于诊断档案指向的内容。这些结果表明,可靠的智能体部署需要寿命评估、机制级诊断和阶段针对性修复,而不仅仅是更强的第一天模型。

英文摘要

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.

2605.26295 2026-05-27 cs.CV

Sleep-stage efficient classification using a lightweight self-supervised model

使用轻量级自监督模型的睡眠阶段高效分类

Eldiane Borges dos Santos Durães, João Batista Florindo

发表机构 * Institute of Mathematics, Statistics and Scientific Computing, University of Campinas, Street Sergio Buarque de Holanda, 651, Campinas, Brazil(数学、统计与科学计算研究所,坎皮纳斯大学,塞格雷奥·布阿尔克·德·霍兰达街651号,巴西坎皮纳斯)

AI总结 本研究通过简化mulEEG自监督模型并结合线性SVM分类器,实现了高效准确的睡眠阶段分类。

Journal ref Proceedings VISAPP 2025, 972-979 (2025)

详情
AI中文摘要

睡眠阶段的准确分类对于诊断睡眠障碍至关重要,自动化该过程可以显著增强临床评估。本研究旨在探索使用自监督模型(具体为mulEEG的改编版本)结合线性SVM分类器来改进睡眠阶段分类。 extbf{方法:} mulEEG模型以自监督方式学习脑电图信号表示,本文通过将ResNet-50替换为ResNet-18主干网络(使用1D卷积作为时间序列编码器)对其进行了简化。还进行了另外两项改编:第一项评估了模型的不同配置和训练数据量,第二项测试了时间序列特征、频谱图特征及其拼接作为线性SVM分类器输入的有效性。 extbf{结果:} 结果显示,与简化模型相比,减少数据量提供了更好的成本效益比。使用ResNet-18的拼接特征也优于原始mulEEG模型的线性评估,实现了更高的分类性能。 extbf{结论:} 简化mulEEG模型以提取特征,并将其与稳健的分类器配对,可实现更高效、更准确的睡眠阶段分类。该方法有望改善临床睡眠评估,并可扩展到其他生物信号分类任务。

英文摘要

Accurate classification of sleep stages is crucial for diagnosing sleep disorders and automating this process can significantly enhance clinical assessments. This study aims to explore the use of a self-supervised model (more specifically, an adapted version of mulEEG) combined with a Linear SVM classifier to improve sleep stage classification. \textbf{Methods:} The mulEEG model, which learns electroencephalogram signal representations in a self-supervised manner, was simplified here by replacing ResNet-50 with 1D-convolutions used as time series encoder by a ResNet-18 backbone. Two other adaptations were conducted: the first one evaluated different configurations of the model and data volume for training, while the second tested the effectiveness of time series features, spectrogram features, and their concatenation as inputs to a Linear SVM classifier. \textbf{Results:} The results showed that reducing the volume of data offered a better cost-benefit ratio compared to simplifying the model. Using the concatenated features with ResNet-18 also outperformed the linear evaluations of the original mulEEG model, achieving higher classification performance. \textbf{Conclusions:} Simplifying the mulEEG model to extract features and pairing it with a robust classifier leads to more efficient and accurate sleep stage classification. This approach holds promise for improving clinical sleep assessments and can be extended to other biological signal classification tasks.

2605.26294 2026-05-27 cs.CV

CNNs, Transformers, Hybrid, and Vision Language Models for Skin Cancer Detection

用于皮肤癌检测的CNN、Transformer、混合模型和视觉语言模型

Durjoy Dey, Yuhong Yan, Hassan Hajjdiab

发表机构 * Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada(计算机科学与软件工程系,康科迪亚大学,加拿大蒙特利尔) Ebovir Biotechnologie Inc., Montreal, Canada(Ebovir生物技术公司,加拿大蒙特利尔)

AI总结 本文在PAD-UFES-20数据集上统一评估了12种深度学习模型(包括CNN、ViT、混合卷积Transformer和视觉语言模型),结果表明混合模型和基于SigLIP的VLM在排名性能和临床相关操作点之间取得了最佳平衡。

Comments 13 pages, 3 figures, accepted at ICPRAI 2026, The Fifth International Conference on Pattern Recognition and Artificial Intelligence. To appear in Lecture Notes in Computer Science

详情
AI中文摘要

皮肤癌是一种常见且快速增长的恶性肿瘤,全球范围内发病率不断上升。早期检测对于改善预后至关重要。基于皮肤镜和临床图像训练的深度学习模型可以支持自动化和快速分诊。然而,许多研究仅评估了有限的架构,且不同研究的实验设置也各不相同。在本文中,我们在PAD-UFES-20数据集上对十二种深度学习模型进行了统一的二分类皮肤癌检测评估。这些模型涵盖四个家族:卷积神经网络(CNN)、视觉Transformer(ViT)、混合卷积Transformer骨干网络和视觉语言模型(VLM)。性能评估使用AUC、最大F1分数及其精确率和召回率,以及在80%特异性下的灵敏度,以反映筛查导向的需求。我们的结果表明,调优良好的CNN已经提供了强大的基线,但基于Transformer的家族持续改善了区分能力。混合模型(MaxViT Tiny、CoAtNet0)和基于SigLIP的VLM在排名性能和临床相关操作点之间实现了最佳整体权衡,而基于CLIP的模型提供了高精确率。所有实验的完整代码库已公开发布。这些发现共同为皮肤癌筛查中实际部署最合适的模型家族提供了实用指导,并为未来在PAD-UFES-20上的工作建立了可重复的参考点。

英文摘要

Skin cancer is a common and fast rising malignancy worldwide. Early detection is critical for improving outcomes. Deep learning models trained on dermoscopic and clinical images can support automated and fast triage. However, many studies evaluate only a limited set of architectures. Experimental setups also vary across studies. In this paper, we present a unified evaluation of twelve deep learning models for binary skin cancer detection on the PAD-UFES-20 dataset. The models span four families: convolutional neural networks (CNN), vision transformers (ViT), hybrid convolution transformer backbones, and vision language models (VLM). Performance is assessed using AUC, the maximum F1 score with its precision and recall, and sensitivity at 80% specificity, reflecting screening oriented requirements. Our results show that well tuned CNNs already provide strong baselines, but transformer based families consistently improve discrimination. Hybrid models (MaxViT Tiny, CoAtNet0) and a SigLIP based VLM achieve the best overall trade off between ranking performance and clinically relevant operating points, while CLIP based model offers high precision. The full codebase for all experiments is publicly released. Together, these findings offer practical guidance on which model families are most suitable for real world deployment in skin cancer screening and establish a reproducible reference point for future work on PAD-UFES-20.

2605.26293 2026-05-27 cs.CL cs.AI

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

CroCo: 基于自生成结果的跨语言对比偏好调优

Mike Zhang, Ali Basirat, Desmond Elliott

发表机构 * Department of Computer Science (DIKU), University of Copenhagen(哥本哈根大学计算机科学系(DIKU)) Centre for Language Technology (CST), University of Copenhagen(哥本哈根大学语言技术中心) Pioneer Centre for Artificial Intelligence(先锋人工智能中心)

AI总结 本文提出CroCo方法,利用英语偏好训练的奖励模型对多语言自生成结果进行对比偏好调优,无需语言特定偏好标注,在14种高低资源语言上提升模型性能,并避免灾难性遗忘。

详情
AI中文摘要

先前工作证实,通过奖励分数设置的大语言模型自生成结果之间的受控对比性,可以改善英语中的下游偏好调优。我们将此方法扩展到多种语言,并在总共14种高资源和低资源语言上,对两个模型在一系列多样化任务上进行评估。我们的核心发现是,跨语言对比偏好调优(CroCo)无需语言特定的偏好标注即可迁移。基于英语偏好(在多语言基础模型之上)训练的奖励模型,在大多数语言中产生了有用的语言内排名,并且在单语或多语设置中进行配对,在大多数设置上改进了每个模型,同时防止了监督微调的灾难性遗忘。我们观察到,这些增益需要基于策略的数据。非策略响应减少了收益,而在线偏好优化未能优于离线变体。具体来说,在结构化任务上,我们的方法在EuroLLM-9B的6/7种语言和Aya-3B的4/7种设置中匹配或超过了基础模型。在开放式生成中,两个调优模型在11种评估语言中均优于各自的基础模型。总体而言,我们展示了多语言偏好调优的有前景的方向。

英文摘要

Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.

2605.26289 2026-05-27 cs.LG

Stateful Inference for Low-Latency Multi-Agent Tool Calling

面向低延迟多智能体工具调用的有状态推理

Victor Norgren

发表机构 * LayerScale, Inc.(LayerScale公司)

AI总结 提出一种有状态推理架构,通过持久化KV缓存和增量处理,将多智能体工具调用的每轮成本从O(n_t)降至O(Δ_t),在6轮和35轮工作流中分别实现2.1倍和4.2倍的加速。

详情
AI中文摘要

多智能体工具调用正成为基于LLM系统的主要交互模式,但现有推理框架将每次工具调用视为独立请求,从头重新处理整个对话,尽管85-95%的提示与上一轮相同。我们提出一种有状态推理架构,将传统服务的每轮O(n_t)成本转换为仅增量O(Δ_t)成本:持久KV缓存跨轮次存在,仅通过摄入新令牌前进,而基数前缀缓存将其扩展到交错的多智能体流量,提示查找推测解码器加速结构化输出。在针对新颖、完全生成的工作负载的测试中,与vLLM和SGLang相比,参考实现在6轮智能体工作流中每轮快2.1倍,在35轮工作流的中位数轮次中快4.2倍,端到端挂钟时间减半。优势来自有状态重用和推测,而非缓存。

英文摘要

Multi-agent tool calling is becoming the dominant interaction pattern for LLM-based systems, yet existing inference frameworks treat each tool call as an independent request, re-processing the entire conversation from scratch even though 85-95% of the prompt is unchanged from the previous turn. We present a stateful inference architecture that converts the $O(n_t)$ per-turn cost of conventional serving into an $O(Δ_t)$ delta-only cost: a persistent KV cache lives across turns and advances by ingesting only the new tokens, while a radix prefix cache extends this across interleaved multi-agent traffic and a prompt-lookup speculative decoder accelerates structured output. Against vLLM and SGLang on novel, fully-generated workloads, the reference implementation is $2.1\times$ faster per turn on a 6-turn agentic workflow and $4.2\times$ on the median turn of a 35-turn one, halving end-to-end wall time. The advantage comes from stateful reuse and speculation, not caching.

2605.26287 2026-05-27 cs.CV

A multifractal-based masked auto-encoder: an application to medical images

基于多重分形的掩码自编码器:在医学图像中的应用

Joao Batista Florindo, Viviane de Moura

发表机构 * Institute of Mathematics, Statistics and Scientific Computing - University of Campinas(数学、统计与科学计算研究所 - 卡波斯大学)

AI总结 提出一种利用多重分形测度(Renyi熵)优化掩码策略的掩码自编码器(MO-MAE),通过聚焦高复杂度区域提升医学图像分类性能。

Journal ref Proceedings VISAPP 2025, 769-776 (2025)

详情
AI中文摘要

掩码自编码器(MAE)在医学图像分类中显示出巨大潜力。然而,传统MAE采用的随机掩码策略可能忽略医学图像中的关键区域,而这些区域中即使微小的变化也可能指示疾病。为解决这一局限性,我们提出了一种利用多重分形测度(Renyi熵)优化掩码策略的新方法。我们的方法称为多重分形优化掩码自编码器(MO-MAE),它采用多重分形分析来识别高复杂度和信息量丰富的区域。通过将掩码过程聚焦于这些区域,MO-MAE确保模型学习重建最具诊断相关性的特征。这种方法对医学成像特别有益,因为精细检查组织结构对于准确诊断至关重要。我们在涵盖多种疾病的多个医学数据集上评估了MO-MAE,包括MedMNIST和COVID-CT。我们的结果表明,MO-MAE取得了有前景的性能,超越了其他基线和最先进的模型。由于所提出的测度计算简单,该方法还增加了最小的计算开销。我们的发现表明,多重分形优化的掩码策略增强了模型捕获和重建复杂组织结构的能力,从而实现了更准确和高效的医学图像表示。所提出的MO-MAE框架为提高医学图像分析中深度学习模型的准确性和效率提供了一个有前景的方向,可能推动计算机辅助诊断领域的发展。

英文摘要

Masked autoencoders (MAE) have shown great promise in medical image classification. However, the random masking strategy employed by traditional MAEs may overlook critical areas in medical images, where even subtle changes can indicate disease. To address this limitation, we propose a novel approach that utilizes a multifractal measure (Renyi entropy) to optimize the masking strategy. Our method, termed Multifractal-Optimized Masked Autoencoder (MO-MAE), employs a multifractal analysis to identify regions of high complexity and information content. By focusing the masking process on these areas, MO-MAE ensures that the model learns to reconstruct the most diagnostically relevant features. This approach is particularly beneficial for medical imaging, where fine-grained inspection of tissue structures is crucial for accurate diagnosis. We evaluate MO-MAE on several medical datasets covering various diseases, including MedMNIST and COVID-CT. Our results demonstrate that MO-MAE achieves promising performance, surpassing other basiline and state-of-the-art models. The proposed method also adds minimum computational overhead as the computation of the proposed measure is straightforward. Our findings suggest that the multifractal-optimized masking strategy enhances the model's ability to capture and reconstruct complex tissue structures, leading to more accurate and efficient medical image representation. The proposed MO-MAE framework offers a promising direction for improving the accuracy and efficiency of deep learning models in medical image analysis, potentially advancing the field of computer-aided diagnosis.

2605.26285 2026-05-27 cs.LG cs.NA math.NA

Two-Parameter Flows for Learning Population Dynamics of Physical Systems

用于学习物理系统群体动力学的双参数流

Paul Schwerdtner, Tobias Blickhan, Benjamin Peherstorfer

发表机构 * Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York, NY 10012, USA(数学科学学院,纽约大学,251 Mercer Street,纽约,NY 10012,美国)

AI总结 提出双参数流方法,通过从基础分布到每个边际的采样时间传输学习高维概率密度动力学,并利用耦合合成轨迹回归提取物理时间速度,无需轨迹信息即可处理旋转等非梯度动力学。

详情
AI中文摘要

本文解决了在无标签样本且不假设轨迹信息的情况下,学习高维概率密度随时间演化的动力学问题。我们引入了双参数流,仅学习从基础分布到每个边际的采样时间传输,然后通过回归耦合的合成轨迹提取物理时间速度。我们证明了所得的物理时间动力学是唯一的,并且继承了采样时间传输的正则性。由于我们可以利用标准且成熟的条件流匹配技术来学习基础到边际的传输,我们的方法可扩展到高维,避免了每步最优传输耦合,同时允许可解释旋转或循环物理现象的非梯度动力学。

英文摘要

This work addresses the problem of learning the dynamics of high-dimensional probability densities over time using unlabeled samples, without assuming access to trajectory information. We introduce two-parameter flows that learn only sampling-time transports from a base distribution to each marginal and then extract a physics-time velocity by regressing on coupled synthetic trajectories. We prove that the resulting physics-time dynamics are unique and inherit regularity from the sampling-time transports. Because we can build on standard, well-developed conditional flow matching techniques for learning the base-to-marginal transports, our approach scales to high dimensions and avoids per-step optimal-transport couplings, while allowing admissible non-gradient dynamics that can naturally explain rotational or circulating physics phenomena.

2605.26284 2026-05-27 cs.RO

PhyPush: One Push is All You Need for Sensorless Physical Property Estimation with Physics-Guided Transformers

PhyPush:基于物理引导的Transformer,一次推动即可实现无传感器物理属性估计

Koyo Fujii, Luis Figueredo, Praminda Caleb-Solly, Ivan Boschi, Edoardo Ida', Marco Carricato, Aly Magassouba

发表机构 * School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院) Dept. of Industrial Engineering, University of Bologna(博洛尼亚大学工业工程系)

AI总结 提出PhyPush框架,利用物理引导的Transformer从单次推动的末端执行器速度估计物体质量和摩擦系数,通过牛顿第二定律和库仑摩擦模型约束提升物理一致性和泛化能力。

Comments Submitted to 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems

详情
AI中文摘要

准确估计物体质量和摩擦是实现可靠自适应机器人操作的基础。尽管交互感知为推断此类属性提供了强大机制,但现有方法大多依赖力/力矩传感器、触觉阵列或多相机运动捕捉系统等专用硬件,限制了可扩展性和部署。本文提出PhyPush,一种物理引导的Transformer框架,仅使用单次推动中运动学推导的末端执行器速度来估计物体的质量和摩擦系数。这通常需要标准机械臂上可用的数据。该模型通过物理引导损失融入牛顿第二定律和库仑摩擦模型的约束,提高了物理一致性以及对未见物体和表面的泛化能力。在多样化的仿真和真实世界设置中,PhyPush在具有挑战性的域外条件下始终能实现更准确的质量和摩擦估计。在仿真中,与能够获取完整力信息的基线相比,误差降低超过10%;在真实世界实验中,其表现优于数据驱动损失方法。总体而言,结果表明物理引导学习能够仅依赖单次推动和现成的运动学数据,实现低成本、传感器高效的物理属性估计。

英文摘要

Accurately estimating object mass and friction is fundamental to achieving reliable and adaptive robotic manipulation. Although interactive perception provides a powerful mechanism for inferring such properties, most existing approaches depend on specialized hardware such as force/torque sensors, tactile arrays, or multi-camera motion-capture systems, limiting scalability and deployment. This paper presents PhyPush, a physics-guided Transformer framework that estimates an object's mass and friction coefficient using only kinematically derived end-effector velocity from a single push. This typically requires data available on standard robotic arms. The model incorporates constraints from Newton's second law and the Coulomb friction model through a physics-guided loss, improving physical consistency and generalization to unseen objects and surfaces. Across diverse simulation and real-world setups, PhyPush consistently achieves more accurate mass and friction estimation in challenging out-of-domain conditions. In simulation, it reduces error by over 10% compared with a baseline that has privileged access to full force information, while in real-world experiments, it outperforms a data-driven loss approach. Overall, the results demonstrate that physics-guided learning can enable low-cost, sensor-efficient estimation of physical properties, relying solely on a single push and readily available kinematic data.

2605.26283 2026-05-27 cs.CV cs.LG

Benchmarking Convolutional, Transformer, Hybrid, and Vision Language Models for Multi Disease Retinal Screening

卷积、Transformer、混合模型及视觉语言模型在多病种视网膜筛查中的基准测试

Durjoy Dey, Aymane Ajbar, Yuhong Yan

发表机构 * Department of Computer Science and Software Engineering(计算机科学与软件工程系) Concordia University(康科迪亚大学) Ebovir Biotechnologie Inc.(Ebovir生物技术公司)

AI总结 本研究在RFMiD数据集上对四种模型家族的12种架构进行基准测试,评估其在多病种视网膜筛查中的性能,发现基于注意力的模型(如SwinTiny、CoAtNet0、MaxViTTiny)在二元筛查和多标签分类中表现最佳,视觉语言模型与CNN基线相当但未超越最优Transformer和混合模型。

Comments 12 pages, 3 figures, accepted at ICMHI 2026, 10th International Conference on Medical and Health Informatics, Kyoto, Japan. To appear in ACM Conference Proceedings

详情
AI中文摘要

现代深度学习为自动化视网膜筛查提供了强大工具,但在现实多病种设置和领域偏移下,不同视觉模型家族的比较仍不明确。本研究使用视网膜眼底多病种图像数据集(RFMiD),对四种模型家族(卷积神经网络、视觉Transformer、混合CNN-Transformer骨干网络和视觉语言模型)的12种架构进行基准测试。我们评估两个任务:任何视网膜疾病的二元筛查和28个疾病类别的多标签分类。通过标准化训练、校准和评估协议,我们报告了在特异性接近80%的临床相关操作点下的AUC、F1、精确率、召回率和灵敏度。在RFMiD上,所有架构在二元筛查中表现良好,AUC均高于84%,但基于注意力的模型表现最佳。SwinTiny以及混合模型CoAtNet0和MaxViTTiny在二元筛查中取得最强结果,并在多标签设置中提高了宏F1和微F1。视觉语言模型(包括CLIP ViT-B/16和SigLIP-Base384)与CNN基线相当,但未超越最优Transformer和混合骨干网络。在Messidor-2上对可转诊糖尿病视网膜病变进行外部验证时,AUC范围为66.8%至84.7%,混合模型和Transformer模型再次表现出强劲性能。这些结果为多病种视网膜筛查中的模型选择提供了可重复的参考,并指导未来用于临床部署的自动化筛查工具。

英文摘要

Modern deep learning offers powerful tools for automated retinal screening, but it remains unclear how different visual model families compare in realistic multi-disease settings and under domain shift. In this work, we benchmark twelve architectures across four model families: convolutional neural networks, vision transformers, hybrid CNN-transformer backbones, and vision-language models, using the Retinal Fundus Multi-disease Image Dataset (RFMiD). We evaluate two tasks: binary screening for any retinal disease and multi-label classification across 28 disease classes. Using standardized training, calibration, and evaluation protocols, we report AUC, F1, precision, recall, and sensitivity at a clinically relevant operating point with specificity near 80%. On RFMiD, all architectures perform well on binary screening, with AUC above 84%, but attention-based models perform best. SwinTiny and the hybrid CoAtNet0 and MaxViTTiny models achieve the strongest binary screening results and improve macro and micro F1 in the multi-label setting. Vision-language models, including CLIP ViT-B/16 and SigLIP-Base384, are competitive with CNN baselines but do not surpass the best transformer and hybrid backbones. In external validation on Messidor-2 for referable diabetic retinopathy, AUC ranges from 66.8% to 84.7%, with hybrid and transformer models again showing strong performance. These results provide a reproducible reference for model selection in multi-disease retinal screening and guide future automated screening tools for clinical deployment.

2605.26282 2026-05-27 cs.LG

Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

通过扩散策略优化扩展世界模型强化学习

Xiaoyuan Cheng, Wenxuan Yuan, Zhancun Mu, Yuanzhao Zhang, Yiming Yang, Hai Wang, Zhuo Sun, Che Liu

发表机构 * Dynamic Systems Lab, University College London(伦敦大学学院动态系统实验室) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院) Santa Fe Institute(圣塔菲研究所) School of Statistics and Data Science, Shanghai University of Finance and Economics(上海财经大学统计与数据科学学院) Department of Computing, Imperial College London(伦敦帝国理工学院计算系)

AI总结 针对世界模型强化学习中搜索与价值学习之间的结构错位问题,提出基于扩散策略优化的模型基方法MBDPO,统一搜索与策略优化,实现可扩展的策略学习。

详情
AI中文摘要

基于模型的强化学习可以通过使用世界模型在大规模下得到有效支持。然而,在实践中,扩展此类方法仍然受到根本性限制。一个普遍公认的挑战是模型偏差和误差累积,这会降低长期预测的质量。除了这些问题,我们识别出一个更关键但尚未充分探索的瓶颈:现有世界模型方法中搜索与价值学习之间的结构错位。特别是,策略改进通常依赖于由独立的非搜索策略诱导的价值函数,导致训练不一致并最终产生次优学习。为了解决这一限制,我们在世界模型中提出基于模型的扩散策略优化(MBDPO),该框架通过扩散策略表示统一搜索和策略优化,从而释放世界模型在可扩展策略学习中的潜力。我们不在学习到的世界模型上构建显式规划器,而是将策略优化重新表述为潜在世界模型中搜索轨迹上的扩散过程。从这个视角,我们从收集的数据集中提取一个隐式能量函数来锚定策略,使MBDPO能够细化用于策略优化的分数场,同时缓解错位问题。我们在多种设置下评估MBDPO,包括多任务离线预训练、在线学习以及离线到在线微调。在离线场景中,我们进一步通过在大规模数据集上预训练来研究其扩展行为,观察到随着模型容量增加,性能持续单调提升。

英文摘要

Model-based reinforcement learning (RL) can be effectively supported at scale through the use of world models. However, in practice, scaling such approaches remains fundamentally limited. A commonly recognized challenge is model bias and error compounding, which degrade long-horizon predictions. Beyond these issues, we identify a more critical yet underexplored bottleneck: a structural misalignment between search and value learning in existing world model approaches. In particular, policy improvement often relies on value functions induced by a separate, non-search policy, resulting in training inconsistency and ultimately suboptimal learning. To address this limitation, we propose Model-Based Diffusion Policy Optimization (MBDPO) in world models, a framework that unifies search and policy optimization through diffusion policy representations, thereby unlocking the potential of world models for scalable policy learning. Instead of constructing an explicit planner over a learned world model, we reformulate policy optimization as a diffusion process over searched trajectories in latent world models. In this view, we extract an implicit energy function from the collected dataset that anchors the policy, enabling MBDPO to refine the score field for policy optimization while mitigating misalignment. We evaluate MBDPO across a wide range of settings, including multi-task offline pretraining, online learning, and offline-to-online fine-tuning. In the offline regime, we further investigate its scaling behavior by pretraining on large-scale datasets, observing consistent and monotonic performance gains with increasing model capacity.

2605.26275 2026-05-27 cs.CL

SPEAR: Code-Augmented Agentic Prompt Optimization

SPEAR: 代码增强的智能体提示优化

Mengyin Lu, Cong Feng, Huimin Han, Guangming Lu, Yu Sun, Xiaonan Ding, Shihui Long, Fengyi Li, Tanvi Motwani

发表机构 * LinkedIn Corporation(领英公司)

AI总结 提出SPEAR方法,将代码执行作为智能体工具进行提示优化,通过Python沙箱实现结构错误分析,并在工业任务和基准测试中取得显著提升。

Comments 19 pages, 3 figures, EMNLP 2026 submission

详情
AI中文摘要

自动提示工程(APE)重写提示以改进下游任务性能,但现有的APE循环将优化器本身视为固定流水线。我们将CodeAct(Wang等人,2024a)的代码即行动范式移植到APE,并提出SPEAR(沙盒化主动回滚提示工程师),一个具有四个工具(评估、python、set_prompt、finish)的自由形式智能体优化器,自主决定如何使用这些工具。独特的工具是Python沙箱:优化器在当前评估DataFrame上编写并执行任意Python代码,进行智能体自身编写的结构错误分析(混淆矩阵、错误聚类、每组指标)。两个护栏将长时程智能体转变为单调改进的优化器:指标回归时的自动回滚,以及可选的防护指标下限。我们在三个工业级LLM作为评判者的套件(涵盖招聘人员面试、对话记忆和查询改进系统的13个评判任务)以及七个BBH任务和GSM8K上进行评估。SPEAR在每个工业任务的主要指标上获胜(工具选择上κ 0.857 vs 0.359;过滤相关性上F1-macro 0.815 vs 0.763;最难提取维度上κ 0.254 vs 0.218)。在BBH-7上,SPEAR平均准确率0.938,而GEPA为0.628,TextGrad为0.484。消融实验表明,Python工具是复杂评判任务上最大的单一杠杆(在5类工具选择评判任务上Δ≈+0.79κ,在移除时最难提取维度上Δ≈+0.35κ);其不可替代的贡献是类对混淆聚合,而长上下文LLM无法从原始评估DataFrame中可靠提取该信息。

英文摘要

Automatic prompt engineering (APE) rewrites prompts to improve downstream task performance, but existing APE loops treat the optimizer itself as a fixed pipeline. We port the code-as-action paradigm of CodeAct (Wang et al., 2024a) to APE and propose SPEAR (Sandboxed Prompt Engineer with Active Roll-back), a free-form agentic optimizer with four tools -- evaluate, python, set_prompt, finish -- that decides autonomously how and when to use them. The distinctive tool is the Python sandbox: the optimizer writes and executes arbitrary Python on the current evaluation DataFrame, performing structural error analysis (confusion matrices, error clustering, per group metrics) the agent itself authors. Two guardrails turn the long-horizon agent into a monotone-improving optimizer: auto-rollback on metric regression, and an optional guard metric floor. We evaluate on three industrial LLM-as-judge suites (13 judge tasks across recruiter-intake, conversational-memory, and query-refinement systems) plus seven BBH tasks and GSM8K. SPEAR wins every industrial task on the primary metric ($κ$ 0.857 vs 0.359 on tool-selection; F1-macro 0.815 vs 0.763 on filter-relevance; $κ$ 0.254 vs 0.218 on the hardest extraction dimension). On BBH-7 SPEAR averages 0.938 accuracy vs GEPA 0.628 and TextGrad 0.484. Ablations show the Python tool is the largest single lever on complex judge tasks ($Δ\approx +0.79κ$ on the 5-class tool-selection judge, $Δ\approx +0.35κ$ on the hardest extraction dimension when removed); its irreplaceable contribution is class-pair confusion aggregation that a long-context LLM cannot extract reliably from the raw eval DataFrame.

2605.26273 2026-05-27 cs.CV

Frequency-Guided Fusion For RGB-Thermal Semantic Segmentation

频率引导的RGB-热红外语义分割融合

İsmail Emre Canıtez, Özgür Erkent

发表机构 * Hacettepe University(哈切特佩大学)

AI总结 提出一种基于双ConvNeXt V2骨干网络的多模态融合架构,通过频率分解和置信门控残差机制融合RGB与热红外特征,在MFNet和PST900上以较低参数量实现先进性能。

Comments 9 pages, 7 figures, To be Presented at Perception Beyond the Visible Spectrum workshop series (IEEE PBVS) at CVPR, 2026

详情
AI中文摘要

在城市驾驶场景等复杂环境中,语义分割在光照条件不佳时仍具挑战性,仅凭RGB图像提供的信息不足。RGB-热红外融合利用可见光和红外图像的互补优势来提升场景理解;然而,在不同特征抽象层次上有效整合这些异质模态仍是一个开放问题。本文提出一种基于双ConvNeXt V2骨干网络的多模态融合架构,采用分阶段、模态自适应的融合策略。对于早期特征,我们引入基于频率的融合模块,通过高斯滤波将红外特征分解为低频和高频分量,应用双分支空间注意力选择性强调热模式与精细边界,并通过置信门控残差机制将其与RGB特征融合。对于后期特征,我们设计了一个具有跨模态注意力和多尺度深度可分离卷积的语义融合模块,以捕捉模态间的语义对应关系。融合后的特征通过带有深度监督的PANet风格双向解码器进行解码。在MFNet和PST900上的实验表明,我们最轻量化的变体分别达到61.73%和86.24%的mIoU,仅需35.43M参数,在显著减少参数和计算成本的同时优于近期方法。代码可在https://github.com/ismailemrecntz/VISIBLE-INFRARED-SENSOR-FUSION获取。

英文摘要

Semantic segmentation in complex environments such as urban driving scenes remains challenging under adverse lighting conditions, where RGB images alone provide insufficient information. RGB-Thermal fusion leverages the complementary strengths of visible and infrared imagery to improve scene understanding; however, effectively integrating these heterogeneous modalities at varying levels of feature abstraction remains an open problem. In this paper, we propose a multi-modal fusion architecture built upon dual ConvNeXt V2 backbones that employs stage-wise, modality-adaptive fusion strategies. For early-stage features, we introduce a Frequency-Based Fusion Module that decomposes infrared features into low- and high-frequency components via Gaussian filtering, applies dual-branch spatial attention to selectively emphasize thermal patterns and fine-grained boundaries, and integrates them with RGB features through a confidence-gated residual mechanism. For late-stage features, we design a semantic fusion module with cross-modal attention and multi-scale depthwise convolutions to capture semantic correspondences across modalities. The fused features are decoded via a PANet-style bidirectional decoder with deep supervision. Experiments on MFNet and PST900 demonstrate that our lightest variant achieves 61.73\% and 86.24\% mIoU, respectively, with only 35.43M parameters, outperforming recent methods while using substantially fewer parameters and lower computational cost. Code is available at https://github.com/ismailemrecntz/VISIBLE-INFRARED-SENSOR-FUSION

2605.26266 2026-05-27 cs.LG cs.AI cs.CV cs.GR eess.IV

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

量化键窃取注意力:视频扩散中KV缓存压缩的偏差校正

Tuna Tuncer, Felix Becker, Thomas Pfeil

发表机构 * Technical University of Munich(慕尼黑技术大学) Tensordyne

AI总结 针对视频扩散模型中KV缓存量化导致注意力权重系统性偏差的问题,提出基于Jensen偏差的在线逐注意力分数校正方法,在INT2量化下恢复接近BF16的视频质量,且内存减半。

Comments Variants of this manuscript were accepted to the ICML 2026 workshops SCALE and F2S

详情
AI中文摘要

分块自回归视频扩散模型依赖先前生成块的KV缓存以避免冗余计算,但随着视频变长,该缓存迅速成为内存瓶颈。将KV缓存量化到低位宽的方法减少了内存压力,但降低了视频质量。我们表明,这种降低的一个关键驱动因素是注意力权重的系统性偏差:由于softmax注意力中指数的凸性,量化噪声膨胀了缓存键的贡献,我们称之为Jensen偏差。这种效应导致量化键从非量化的当前块中窃取注意力质量。我们推导出一个逐注意力分数校正,在期望中消除此偏差,该校正根据缓存键的量化步长和查询范数在线计算。使用二阶泰勒近似,额外的计算开销可忽略不计,且除了缓存外无需额外内存。在MAGI-1、SkyReels-V2和HY-WorldPlay上评估INT2量化,我们的校正恢复了因激进量化而损失的大部分质量,达到接近BF16的视频质量,并且在使用50%更少内存的情况下优于INT4量化。

英文摘要

Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.

2605.26262 2026-05-27 cs.CV

Dimensional Distribution Emotion State: Leveraging Valence and Arousal as a Common Embedding Space for Visual Emotion Analysis

维度分布情感状态:利用效价和唤醒作为视觉情感分析的通用嵌入空间

Émile Bergeron, Tadagbé Dhossou, Sébastien Tremblay, Jean-François Lalonde

发表机构 * Université Laval(拉瓦尔大学)

AI总结 提出一种新的情感表示方法DDES,结合连续双维情感空间和多数据集训练流程,以辅助博物馆策展人预测艺术品引发的情感反应。

详情
AI中文摘要

博物馆是传播文化艺术的重要场所。它们是植根于历史和传统的机构;其展览通常旨在突出这些方面。最近,该领域正在探索一种新方法:基于情感的展览。这些展览专门设计用于引发游客的情感,以最大化参与度,并作为民主化艺术接触和吸引更广泛、更多样化观众的一种方式。为此,必须首先提取艺术品的情感内容,然而,由专家手动标注艺术品是一个劳动密集且成本高昂的过程,并且存在引入策展人个人偏见的风险。为了协助博物馆策展人设计这些展览,我们希望开发一种能够预测艺术作品所引发的情感反应的工具。在本文中,我们利用连续的双维情感空间来增强情感表示和深度学习模型的训练过程。借鉴现有的分类和维度情感表示,我们引入了一种新的表示方法——维度分布情感状态(DDES),以及一个多数据集训练流程。我们表明,与广泛使用的表示相比,DDES提供了多种优势,同时表现出相似的基线性能。

英文摘要

Museums are important sites for the dissemination of culture and art. They are institutions rooted in history and tradition; their exhibitions are often designed to highlight these aspects. Recently, a new approach is being explored in the field: emotion-based exhibitions. These exhibitions are designed specifically to elicit emotions in the visitors, in order to maximize engagement, and as a way to democratize access to art and attract a wider, more diverse audience. To do so, the emotional content of the artworks must first be extracted, however, manually annotating the artworks by experts is a prohibitively labor-intensive process, and risks introducing the personal bias of curators. To assist the museum curators in their design of these exhibitions, we wish to develop a tool that can predict the emotional response evoked by a work of art. In this article, we leverage a continuous bi-dimensional emotion space to enhance emotion representations and the training process of deep learning models. Drawing inspiration from existing categorical and dimensional emotion representations, we introduce a new representation, Dimensional Distribution Emotion State (DDES), along with a pipeline for multi-dataset training. We show that DDES provides multiple advantages compared to widely used representations while exhibiting similar baseline performance.

2605.26256 2026-05-27 cs.AI

Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

个性化具身多模态大语言模型代理在长期用户交互中的应用

Jeongeun Lee, Chanyoung Park, Dongha Lee

发表机构 * Yonsei University(延世大学) KAIST(韩国科学技术院)

AI总结 提出POLAR框架,通过多模态知识图谱记忆机制增强具身代理在长期交互中的个性化能力,显著提升多步推理和用户上下文跟踪性能。

详情
AI中文摘要

基于多模态大语言模型的具身代理在物理环境中解决复杂任务方面展现出强大潜力。然而,个性化辅助不仅需要遵循通用指令或识别物体类别。在现实场景中,目标通常仅通过先前的交互隐式指定,要求代理利用随时间积累的个性化上下文。在这项工作中,我们提出了POLAR,一个用于长期用户交互中个性化具身代理的多模态记忆增强框架。POLAR将先前的交互组织成一个多模态知识图谱,该图谱捕获用于个性化上下文和视觉概念的语义记忆,以及用于代理轨迹等具身经验的 episodic 记忆。为了执行具身任务,POLAR检索相关记忆以解释当前请求并指导任务执行。我们在多个MLLM骨干网络和多样化的评估场景下评估POLAR,以研究记忆在长期个性化中的作用。结果表明,所提出的记忆机制通过更有效地利用先前交互中积累的信息,持续提升性能。当代理需要在多个交互中进行推理、执行多跳推理或随时间跟踪用户特定上下文的更新时,性能提升尤为显著。

英文摘要

Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real-world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized context accumulated over time. In this work, we propose POLAR, a multiomodal memory-augmented framework for personalized embodied agents over long-term user interactions. POLAR organizes prior interactions into a multimodal knowledge graph that captures semantic memory for personalized context and visual concepts, and episodic memory for embodied experiences such as agent trajectories. To execute embodied tasks, POLAR retrieves relevant memories to interpret the current request and guide task execution. We evaluate POLAR across multiple MLLM backbones and diverse evaluation scenarios to study the role of memory in long-term personalization. Results show that the proposed memory mechanism consistently improves performance by enabling more effective use of information accumulated over prior interactions. The gains are especially pronounced when the agents are required to reason across multiple interactions, perform multi-hop inference, or tracking updates in user-specific context over time.

2605.26252 2026-05-27 cs.AI cs.DB

Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

智能体记忆是数据库吗?重新思考长期AI智能体记忆的数据基础

Abdelghny Orogat, Essam Mansour

发表机构 * Concordia University(康科迪亚大学)

AI总结 本文提出将长期AI智能体记忆视为一种新的数据管理工作负载,通过形式化治理演化记忆(GEM)框架,用四个状态级操作替代记录级操作,并论证记录级系统无法满足其正确性条件,最后通过原型MemState验证可行性并指出未来研究方向。

详情
AI中文摘要

长期运行的AI智能体需要持久记忆。记忆支持跨会话的学习,减少重复的上下文注入,并能够审计过去的决策。当前的智能体记忆系统和数据库范式将记忆视为存储。它们将正确性定位在记录、嵌入或边上。每个只提供了长期记忆所需的部分能力。结果导致四种反复出现的故障模式:无节制的增长、缺乏语义修订、容量驱动的遗忘以及只读检索。在我们的愿景中,长期智能体记忆是一种新的数据管理工作负载。其正确性是状态轨迹的属性,而非单个记录的属性。我们将其形式化为治理演化记忆(GEM)。GEM用四个状态级操作替代记录级数据库操作:摄取、修订、遗忘和检索。六个正确性条件控制状态如何演化。三个结构性观察表明,无论存储模型如何,没有记录级系统能够满足这些条件。我们在一个属性图后端上实现了该抽象的原型MemState。MemState验证了可行性并揭示了与原生引擎之间的差距。我们概述了三个研究方向,将记忆中心的数据管理定义为一个工作负载。

英文摘要

Long-running AI agents need persistent memory. Memory supports learning across sessions, reduces repeated context injection, and enables auditing of past decisions. Current agent memory systems and database paradigms treat memory as storage. They localize correctness at records, embeddings, or edges. Each supplies only some of the capabilities that long-term memory requires. The result is four recurring failure modes: unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval. In our vision, long-term agent memory is a new data-management workload. Its correctness is a property of the state trajectory, not of individual records. We formalize this as Governed Evolving Memory (GEM). GEM replaces record-level database operations with four state-level operators: ingestion, revision, forgetting, and retrieval. Six correctness conditions govern how the state evolves. Three structural observations establish that no record-level system can satisfy these conditions, regardless of the storage model. We realize the abstraction in MemState, a prototype on a property-graph backend. MemState validates feasibility and exposes the gap to a native engine. We outline three research directions that define memory-centric data management as a workload.

2605.26248 2026-05-27 cs.LG cs.AI cs.NE

Unified Neural Scaling Laws

统一神经缩放定律

Ethan Caballero, Priyank Jaini, David Krueger, Irina Rish

发表机构 * Mila, University of Montreal(蒙特利尔大学Mila实验室) Google DeepMind(谷歌DeepMind)

AI总结 提出一种统一神经缩放定律(UNSL)函数形式,能够准确建模和预测深度神经网络在多个维度(模型参数、训练数据量、训练步数、推理步数、计算量及超参数)同时变化时的缩放行为,适用于多种架构和任务,并在大规模视觉、语言、数学和强化学习任务中实现更精确的缩放行为外推。

详情
AI中文摘要

我们提出了一种函数形式(称为统一神经缩放定律(UNSL)),该形式能够准确建模和预测深度神经网络在多个维度(即评估指标如何随模型参数数量、训练数据集大小、训练步数、推理步数、计算量以及各种超参数同时变化)同时变化时的缩放行为,适用于多种架构以及各种上游和下游任务中的每个任务。这些任务包括大规模视觉、语言、数学和强化学习。与其他神经缩放的函数形式相比,该函数形式在该任务集上产生的缩放行为外推结果显著更准确。

英文摘要

We present a functional form (that we refer to as a Unified Neural Scaling Law (UNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks as multiple dimensions all vary simultaneously (i.e. how the evaluation metric of interest varies as one simultaneously varies the number of model parameters, training dataset size, number of training steps, number of inference steps, amount of compute, and various hyperparameters) for various architectures and for each of various tasks within a varied set of upstream and downstream tasks. This set includes large-scale vision, language, math, and reinforcement learning. When compared to other functional forms for neural scaling, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set.

2605.26246 2026-05-27 cs.LG

The Bridge-Garden Dilemma in LLM Distillation: Why Mixing Hard and Soft Labels Works

LLM蒸馏中的桥园困境:为什么混合硬标签和软标签有效

Guanghui Wang, Kaiwen Lv Kacuila, Zhiyong Yang, Zitai Wang, Jin-Wen Wu, Longtao Huang, Qianqian Xu, Qingming Huang

发表机构 * School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学计算机科学与技术学院) Alibaba Group, Hangzhou, China(阿里巴巴集团) State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China(中国科学院人工智能安全国家重点实验室) Beijing Academy of Artificial Intelligence, Beijing, China(北京人工智能研究院) Key Laboratory of Big Data Mining and Knowledge Management (BDKM), University of Chinese Academy of Sciences, Beijing, China(中国科学院大数据挖掘与知识管理重点实验室)

AI总结 针对大语言模型知识蒸馏中硬标签与软标签的混合使用,提出桥园分解理论解释其降低暴露偏差的机制,并开发自适应混合监督方法,在多个模型上实现性能提升和9.7倍训练成本降低。

Comments Accepted at ICML 2026

详情
AI中文摘要

知识蒸馏(KD)将知识从大型教师模型转移到较小的学生模型。在语言建模中,学生模型要么在从教师模型采样的标记(硬标签)上训练,要么在教师模型的完整下一个标记分布(软标签)上训练。尽管软标签看起来严格更丰富,但我们发现混合硬标签和软标签始终能产生更好的结果。关键的是,我们表明这种增益不能通过训练期间更接近教师匹配来解释。相反,它来自于减少暴露偏差,即训练和推理分布之间的不匹配。为了解释这一现象,我们引入了桥园分解理论,该理论将生成步骤分为两类:桥(Bridge),其中下一个标记必须精确;园(Garden),其中下一个标记可以灵活。我们表明,仅硬标签的KD在桥中通过避免风险偏差表现出色,而仅软标签的KD在园中保持多样性。混合策略处理两种情况,从而减少整个序列中的暴露偏差。在该理论的指导下,我们开发了一系列桥园混合监督方法,自适应地平衡硬标签和软标签。在包含七个教师-学生对(包括Qwen、Llama、Gemma和DeepSeek)的主要套件以及推理和编码基准测试中,我们的方法优于基于散度和基于策略的KD基线,同时将训练成本降低了9.7倍,实现了高效的模型压缩。代码可在https://github.com/ghwang-s/bridge_garden_hybrid_kd_release获取。

英文摘要

Knowledge distillation (KD) transfers knowledge from a large teacher model to a smaller student. In language modeling, the student is trained either on tokens sampled from the teacher (hard labels) or the teacher's full next-token distribution (soft labels). Despite soft labels appear strictly richer, we find that mixing hard and soft labels consistently yields better results. Crucially, we show that this gain cannot be explained by closer teacher matching during training. Instead, it comes from reduced exposure bias, the mismatch between training and inference distributions. To explain this phenomenon, we introduce the Bridge-Garden Decomposition theory, which categorizes generation steps into two types: Bridges, where the next token must be exact, and Gardens, where it can be flexible. We show that hard-only KD excels in Bridges by avoiding risky deviations, while soft-only KD preserves diversity in Gardens. A hybrid strategy handles both cases and, as a result, reduces exposure bias across the sequence. Guided by this theory, we develop a family of Bridge-Garden hybrid supervision methods that adaptively balance hard and soft labels. Across a primary suite of seven teacher-student pairs (including Qwen, Llama, Gemma, and DeepSeek) and benchmarks in reasoning and coding, our approach outperforms divergence-based and on-policy KD baselines while reducing training cost by 9.7x, enabling efficient model compression. Code is available at https://github.com/ghwang-s/bridge_garden_hybrid_kd_release.

2605.26244 2026-05-27 cs.CV cs.MM cs.SD

LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV

LongAV-Compass:面向分钟级音视频生成在T2AV、I2AV和V2AV上的统一评估

Tengfei Liu, Yang Shi, Xuanyu Zhu, Jiafu Tang, Liu Yang, Qixun Wang, Zhuoran Zhang, Yuqi Tang, Fengxiang Wang, Yuhao Dong, Xinlong Chen, Bozhou Li, Bohan Zeng, Yue Ding, Xiaohan Zhang, Jialu Chen, Haotian Wang, Yuanxing Zhang, Pengfei Wan, Leye Wang

发表机构 * Peking University(北京大学) Kling Team(Kling团队) Nanjing University(南京大学) SJTU(上海交通大学) HKUST(GZ)(香港科技大学(广州)) Shanghai AI Lab(上海人工智能实验室) Nanyang Technological University(南洋理工大学) CASIA(中国科学院自动化所) Tsinghua University(清华大学)

AI总结 针对现有评估协议局限于短片段的问题,提出LongAV-Compass基准,通过284个测试用例和统一评估框架,系统评估分钟级音视频生成在文本、图像、视频条件下的质量、一致性和对齐。

详情
AI中文摘要

音视频生成正从短片段快速发展到分钟级内容,而现有评估协议仍主要局限于短片段设置。现有基准主要关注5-10秒的文本条件生成,很少支持跨文本、图像和视频条件模态的统一评估。此外,它们对身份一致性、叙事连贯性和音视频对齐在长时间跨度上的退化提供的洞察有限。为弥补这一差距,我们引入了LongAV-Compass,一个用于分钟级音视频生成的系统基准。LongAV-Compass包含284个精选测试用例,涵盖文本到音视频(T2AV)、图像到音视频(I2AV)和视频到音视频(V2AV),按应用场景和生成复杂度组织。该基准结合了基于分类法的基准构建和统一评估框架,该框架集成了MLLM辅助评估与互补的感知和多模态指标,包括DINO-v2、ArcFace、CLIP和ImageBind。该框架评估超过20个细粒度维度,涵盖片段内质量、跨片段一致性、全局叙事连贯性、语义对齐和音视频同步。通过对11个代表性模型的实验以及人类对齐验证,LongAV-Compass提供了一个诊断测试平台,用于分析当前系统在跨不同输入模态维持连贯、语义对齐和时间一致的分钟级音视频生成方面的局限性。

英文摘要

Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.

2605.26243 2026-05-27 cs.LG

Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks

可证明通信高效且隐私保护的联邦图神经网络

Zhishuai Guo, Wenhan Wu, Chen Chen, Lei Zhang, Olivera Kotevska, Ravi K Madduri

发表机构 * Northern Illinois University(北伊利诺伊大学) University of North Carolina at Charlotte(北卡罗来纳州立大学查珀尔山分校) University of Central Florida(中央佛罗里达大学) Oak Ridge National Laboratory(橡树岭国家实验室) Argonne National Laboratory(阿贡国家实验室)

AI总结 提出CE-FedGNN框架,通过稀疏交换聚合节点表示和移动平均估计器处理跨客户端依赖,结合度量差分隐私实现通信高效与隐私保护,并证明收敛速率和隐私保证。

详情
AI中文摘要

图神经网络(GNN)在关系数据上取得了强性能,但现实世界的图通常分布在多个组织之间,由于隐私和政策约束,这些组织无法共享原始数据。现有的联邦GNN方法要么忽略跨客户端链接导致精度下降,要么需要频繁的嵌入交换,带来巨大的通信和隐私成本。我们提出了CE-FedGNN,一个通信高效且隐私保护的联邦GNN框架,用于学习此类耦合图。我们的方法避免共享原始数据或每轮嵌入,而是通过稀疏交换聚合的节点表示。为了处理跨客户端依赖和过时性,我们引入了一个移动平均估计器,持续跟踪节点表示并使其能够在多轮中稳定重用。为了为发布的表示提供正式的隐私保证,我们采用了度量差分隐私(metric-DP)框架,该框架根据学习嵌入空间中的距离而非最坏情况输入扰动来衡量隐私。这在标准差分隐私变得过于保守的噪声水平下提供了有意义的保证。我们建立了以$O(1/\sqrt{T})$速率收敛到稳定点,通信复杂度为$O(T^{3/4})$。此外,我们在公共队列威胁模型下通过Rényi差分隐私组合推导了$(\varepsilon,\delta)$-度量差分隐私保证。在合成银行间反洗钱基准和引文网络上的实验表明,CE-FedGNN在显著降低通信的同时保持了强性能,并在隐私保护噪声下保持鲁棒性。

英文摘要

Graph neural networks (GNNs) achieve strong performance on relational data, but real-world graphs are often distributed across organizations that cannot share raw data due to privacy and policy constraints. Existing federated GNN methods either ignore cross-client links, leading to degraded accuracy, or require frequent embedding exchanges, incurring substantial communication and privacy costs. We propose CE-FedGNN, a communication-efficient and privacy-preserving federated GNN framework for learning over such coupled graphs. Our approach avoids sharing raw data or per-round embeddings by infrequently exchanging aggregated node representations. To handle cross-client dependency and staleness, we introduce a moving-average estimator that continuously tracks node representations and enables their stable reuse across rounds. To provide formal privacy guarantees for the released representations, we adopt the metric differential privacy (metric-DP) framework, which measures privacy with respect to distances in the learned embedding space rather than worst-case input perturbations. This yields meaningful guarantees at noise levels where standard differential privacy becomes overly conservative. We establish convergence to a stationary point at a rate of $O(1/\sqrt{T})$ with $O(T^{3/4})$ communication complexity. In addition, we derive $(\varepsilon,δ)$-metric-DP guarantees via Rényi differential privacy composition under a public-cohort threat model. Experiments on synthetic interbank anti-money laundering benchmarks and citation networks demonstrate that CE-FedGNN achieves strong performance while significantly reducing communication and maintaining robustness under privacy-preserving noise.

2605.26242 2026-05-27 cs.AI

Can LLMs Introspect? A Reality Check

LLM 能否内省?一个现实检验

Shashwat Singh, Tal Linzen, Shauli Ravfogel

发表机构 * Center for Data Science(数据科学中心)

AI总结 本文基于人类元认知研究的教训,质疑大型语言模型能否真正内省,并通过重新审视两个评估范式发现,当前证据不足以证明LLM具有元认知监控能力。

详情
AI中文摘要

大型语言模型能否检测并报告自身的内部状态?许多研究认为答案是肯定的。我们基于人类元认知研究的教训认为,这一结论可能为时过早:要确信这一结论,我们需要区分真正的内省与基于表面线索的模式匹配。此外,我们认为仅凭行为证据本身不足以建立强有力的内省主张。 我们在此考虑下重新审视了两个最近引入的评估范式。在第一个范式中,模型需要检测其内部状态是否被篡改。我们发现,模型无法可靠地区分对其内部状态的干预与对输入的操纵,这表明它们在原始研究中的成功反映了它们更一般地检测异常的能力,而非特别针对其内部状态的干预。在我们检查的第二个范式中,模型被要求预测从其自身隐藏状态派生的标签。我们发现,仅能访问输入的分类器达到了与模型自身上下文预测相当的性能,这表明原始结果并未决定性地证明模型对其内部表示具有特权访问。我们进一步引入了一个重新标记的控制设置,其中模型不能依赖任务的语义来解决问题,而必须依赖内部表示;在这个更好控制的版本中,模型的表现更接近随机。综合这些结果,表明当前证据不足以证明LLM表现出元认知监控。

英文摘要

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.

2605.26241 2026-05-27 cs.CV

RoMo: A Large-Scale, Richly Organized Dataset and Semantic Taxonomy for Human Motion Generation

RoMo:用于人体运动生成的大规模、丰富组织的数据集和语义分类体系

Jiahao Zhang, Joseph Liu, Young-Yoon Lee, Seonghyeon Moon, Victor Zordan, Guy Tevet, Karen Liu, Stephen Gould, Oren Jacob, Haomiao Jiang, Mubbasir Kapadia, Yizhak Ben-Shabat

发表机构 * Australian National University(澳大利亚国立大学) Roblox Stanford University(斯坦福大学) Rutgers University(罗格斯大学)

AI总结 提出RoMo数据集,通过分类感知过滤流水线确保质量,并采用三级语义分类体系组织数据,使训练模型在保真度和多样性上达到最优,同时提升对复杂文本提示的理解。

Comments Accepted to CVPR'26

详情
AI中文摘要

在语言、图像和视频领域的生成建模成功表明,大型、精心策划的数据集是构建强大模型的关键驱动力。然而,3D人体运动领域一直滞后,受限于在小型高保真运动捕捉数据集和以静态或低质量序列为主的大规模野外数据集之间的不满意选择。我们引入了RoMo,一个丰富、大规模、精心策划的野外人体运动数据集,解决了这些权衡。为确保质量,我们引入了一个分类感知过滤流水线,积极去除静态和易产生伪影的序列。每个序列都带有详细注释,并由一个新颖的三级语义分类体系组织。这种层次结构实现了细粒度的逐类别评估,揭示了全局指标所掩盖的模型优势和弱点。我们证明,在RoMo上训练的模型在保真度和多样性上达到最优,同时获得了对复杂、细微文本提示的卓越理解。最后,我们发布了Motion Toolbox以标准化指标、数据转换和可视化,为可重复和可解释的运动生成研究奠定了基础。

英文摘要

Success in generative modeling across language, image, and video demonstrates that large, well-curated datasets are the key driver for building capable models. 3D Human motion, however, has lagged behind, constrained by an unsatisfying choice between small, high-fidelity motion capture datasets and large-scale in-the-wild collections dominated by static or low-quality sequences. We introduce RoMo, a rich, large-scale, carefully curated dataset of in-the-wild human motions that resolves these tradeoffs. To ensure quality, we introduce a taxonomy-aware filtering pipeline that aggressively removes static and artifact-prone sequences. Every sequence is annotated with detailed captions and organized by a novel three-level semantic taxonomy. This hierarchical structure enables fine-grained, per-category evaluation, that reveals model strengths and weaknesses obscured by global metrics. We demonstrate that models trained on RoMo achieve state-of-the-art fidelity and diversity while gaining a superior understanding of complex, subtle text prompts. Finally, we release the Motion Toolbox to standardize metrics, data conversion, and visualization, establishing a foundation for reproducible and interpretable motion generation research.

2605.26239 2026-05-27 cs.CV cs.MA

Sentinel: Embodied Cooperative Spatial Reasoning and Planning

Sentinel:具身协同空间推理与规划

Xiangye Lin, Hongxin Zhang, Ruxi Deng, Qinhong Zhou, Chuang Gan

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出Sentinel挑战和CoSaR框架,通过自然语言通信与空间导航算法结合,解决多智能体在城市规模户外环境中的协同空间推理与规划问题。

Comments The first two authors contributed equally

详情
AI中文摘要

在这项工作中,我们研究了协同空间智能,即分散的具身智能体在跨城市规模的户外领域中,在动态环境约束下有效协调的能力。我们引入了Sentinel挑战,这是一个基准测试,其中多个分散的具身智能体必须通过自然语言进行通信,以在大规模城市户外环境中就一个相互安全且方便的会合点达成一致。然后,每个智能体必须安全导航,同时避开巡逻的动态哨兵,并使用提供粗略空间信息的工具。为了解决这个问题,我们提出了CoSaR(协同空间推理与规划)框架,该框架将基础模型的高层通信和规划能力与经典空间导航算法的精度相结合。CoSaR使智能体能够交换情境更新、推理不断变化的空间约束,并协同重新规划轨迹。在14个城市级别场景(包含3-5个智能体)的评估中,CoSaR始终导致更快的聚集、更短的路径长度和更高的安全性。我们的结果表明,将动态通信与空间推理相结合对于鲁棒的多智能体协作至关重要。通过形式化这一新设置并提供可扩展的基准测试,我们旨在为推进具身多智能体系统中的协同空间智能奠定基础。代码和挑战可在https://github.com/UMass-Embodied-AGI/Sentinel获取。

英文摘要

In this work, we study Cooperative Spatial Intelligence, the ability of decentralized embodied agents to coordinate effectively under dynamic environmental constraints across city-scale outdoor domains. We introduce Sentinel Challenge, a benchmark where multiple decentralized embodied agents must communicate in natural language to agree on a mutually safe and convenient meeting point within large, city-scale outdoor environments. Each agent must then navigate safely while avoiding dynamic sentinels patrolling the area, using a tool that provides coarse spatial information. To address this, we propose CoSaR (Cooperative Spatial Reasoning and Planning), a framework that bridges the high-level communication and planning abilities of foundation models with the precision of classical spatial navigation algorithms. CoSaR enables agents to exchange situational updates, reason over evolving spatial constraints, and collaboratively replan trajectories. Evaluated across 14 city-level scenes with 3-5 agents, CoSaR consistently leads to faster gathering, shorter path lengths, and improved safety. Our results demonstrate that integrating dynamic communication with spatial reasoning is essential for robust multi-agent cooperation. By formalizing this new setting and providing a scalable benchmark, we aim to build a foundation for advancing cooperative spatial intelligence in embodied multi-agent systems. Code and challenge are available at https://github.com/UMass-Embodied-AGI/Sentinel.

2605.26232 2026-05-27 cs.CV

Not All Modalities Are Equal: Instruction-Aware Gating for Multimodal Videos

并非所有模态都平等:面向多模态视频的指令感知门控

Bonan Ding, Umair Nawaz, Ufaq Khan, Abdelrahman M. Shaker, Muhammad Haris Khan, Jiale Cao, Jin Xie, Fahad Shahbaz Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎伊德大学人工智能学院) Tianjin University(天津大学) Chongqing University(重庆大学) Linköping University(林奈大学)

AI总结 提出UniMVU框架,通过内模态和模态级指令感知动态门控实现多模态视频理解,在六个基准上优于静态融合方法。

Comments 19 pages, 8 figures, 7 tables, preprint

详情
AI中文摘要

预训练视频大语言模型在视觉推理方面表现出色。然而,当视频伴随辅助流(如音频、深度图或密集时间证据)时,它们会陷入困境。在这种情况下,统一融合会导致模态干扰,使不相关的通道分散模型注意力。为了解决这个问题,我们提出了一个统一的多模态视频理解框架UniMVU,该框架通过两个级别的动态门控在视频、音频、深度图或任何其他模态输入之间执行指令感知融合:内模态门强调每个模态内的显著区域,而模态级门重新加权整个流;两者都根据文本指令进行条件化,以自适应地平衡模态重要性。我们的UniMVU将跨模态自注意力与指令驱动的内模态门控模块以及带有控制令牌的模态级门控模块相结合;对于时间对齐的流,我们进一步采用了一种快慢融合方案,以减少冗余。在六个基准(AVQA、AVSD、Music-AVQA、ScanQA、SQA3D和MVBench)上,我们的UniMVU相对于静态融合基线取得了一致的提升,在CIDEr指标上最高提升了13.5。此外,我们的分析表明,门控机制与人类可解释的模态相关性一致,消融实验显示了内模态和模态级门控的贡献。我们的UniMVU为指令感知的多模态视频理解提供了一种简单、统一的方案,无需手工设计的融合规则即可扩展到多种模态。

英文摘要

Pre-trained video large language models excel at visual reasoning. However, they struggle when videos arrive with auxiliary streams, such as audio, depth map, or dense temporal evidence. In such a scenario, uniform fusion induces modality interference, allowing irrelevant channels to distract the model. To address this issue, we present a unified multimodal video understanding framework, named UniMVU, that performs instruction-aware fusion across video, audio, depth map, or any other modality inputs via two levels of dynamic gating: inner-modality gates emphasize salient regions within each modality, whereas modality-level gates re-weight whole streams; both are conditioned on the text instruction to adaptively balance modality importance. Our UniMVU combines cross-modal self-attention with instruction-driven inner-modality gating module and a modality-level gating module with control token; for time-aligned streams we further adopt a fast-to-slow fusion scheme that reduces redundancy. Across six benchmarks (AVQA, AVSD, Music-AVQA, ScanQA, SQA3D and MVBench), our UniMVU achieves consistent gains over static-fusion baselines achieving gains as high as 13.5 in terms of CIDEr metric. Further, our analysis shows that the gating mechanism aligns with the human-interpretable modality relevance, and ablations show the contributions of inner-modality and modality-level gating. Our UniMVU provides a simple, unified recipe for instruction-aware multimodal video understanding that scales to diverse modalities without hand-crafted fusion rules.

2605.26230 2026-05-27 cs.CV

Geometry-Aware Representation Denoising for Robust Multi-view 3D Reconstruction

几何感知表示去噪用于鲁棒的多视角3D重建

Jin Hyeon Kim, Jaeeun Lee, Claire Kim, Kyoungjin Oh, Paul Hyunbin Cho, Jaewon Min, Yeji Choi, Jihye Park, Hyunhee Park, Minkyu Park, Seungryong Kim

发表机构 * KAIST AI(韩国国立科学技术院人工智能研究所) Samsung Electronics(三星电子)

AI总结 提出几何感知表示去噪(GARD)框架,在前馈3D重建模型的特征空间中执行扩散式多视角恢复,同时恢复场景几何与高质量RGB图像,在DA3基准上验证有效性。

详情
AI中文摘要

多视角3D重建随着前馈3D重建模型的出现取得了显著进展。然而,这些模型通常在理想的无退化成像条件下训练和评估,而真实世界的观测往往包含与此类设置显著不同的退化。因此,在退化条件下提高多视角3D重建的鲁棒性仍然是一个重要挑战。我们提出了几何感知表示去噪(GARD),一种新颖的框架,直接在前馈3D重建模型的特征空间中执行基于扩散的多视角恢复。这种设计利用3D重建器的几何感知特征表示来有效恢复准确的场景几何。此外,通过使用额外的RGB图像解码器,精炼的表示还可用于恢复高质量的RGB图像,从而同时恢复3D场景几何和高质量图像。在Depth Anything 3(DA3)基准上的全面实验证明了所提出的GARD框架的有效性。

英文摘要

Multi-view 3D reconstruction has achieved remarkable progress with the advent of feed-forward 3D reconstruction models. However, these models are typically trained and evaluated under ideal, degradation-free imaging conditions, whereas real-world observations often contain degradations that differ significantly from such settings. Improving robustness for multi-view 3D reconstruction under degraded conditions therefore remains an important challenge. We present Geometry-Aware Representation Denoising (GARD), a novel framework that performs diffusion-based multi-view restoration directly in the feature space of a feed-forward 3D reconstruction model. This design exploits the geometry-aware feature representations of the 3D reconstructor to effectively recover accurate scene geometry. Furthermore, by employing an additional RGB image decoder, the refined representations can also be used to restore high-quality RGB images, thereby enabling the simultaneous recovery of 3D scene geometry and high-quality imagery. Comprehensive experiments on the Depth Anything 3 (DA3) benchmark demonstrate the effectiveness of the proposed GARD framework.

2605.26222 2026-05-27 cs.LG stat.ML

From Privacy to Generalization: Linear Max-Information Bounds for DP-SGD

从隐私到泛化:DP-SGD的线性最大信息界

Christoph H. Lampert, Hossein Zakerinia

发表机构 * Institute of Science and Technology Austria (ISTA)(奥地利科学与技术研究所)

AI总结 本文证明了DP-SGD的近似最大信息量具有与数据集大小成线性关系的有限样本界,并基于此推导出PAC-Bayes泛化界和DP-SGD训练模型的显式泛化界。

Comments 22 pages

详情
AI中文摘要

理解泛化与隐私之间的关系仍然是现代机器学习理论中的一个核心挑战,特别是对于通过差分隐私随机梯度下降(DP-SGD)变体训练的深度网络。在这项工作中,我们通过证明DP-SGD的近似最大信息量的有限样本界,该界展现出与(Dwork et al, 2015)关于$ε$-差分隐私算法的经典结果相当的缩放性质,即最多与数据集大小成线性关系,从而在这个长期存在的开放问题上取得了进展。根据我们的结果,我们得到了一个通用的PAC-Bayes泛化界,其中所需的先验分布可以由DP-SGD学习,以及一个针对DP-SGD训练模型本身的泛化界,其复杂度项完全显式且由优化超参数控制。

英文摘要

Understanding the relationship between generalization and privacy remains a central challenge in modern machine learning theory, particularly for deep networks trained by variants of differentially private stochastic gradient descent (DP-SGD). In this work we make progress on this persistent open problem by proving a finite-sample bound on the approximate max-information of DP-SGD that exhibits scaling properties comparable with (Dwork et al, 2015)'s classic result for $ε$-differentially private algorithms, namely at most linear in the dataset size. From our result we obtain a general-purpose PAC-Bayes generalization bound in which the necessary prior distribution can be learned by DP-SGD, as well as a generalization bound for DP-SGD-trained models themselves, with a complexity term that is fully explicit and controlled by the optimization hyperparameters.