arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2070
2606.09426 2026-06-11 cs.AI 版本更新

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

WeaveBench: 面向混合接口的长期、真实世界计算机使用代理基准

Wanli Li, Bowen Zhou, Yunyao Yu, Zhou Xu, Yifan Yang, Dongsheng Li, Caihua Shan

发表机构 * Zhejiang University(浙江大学) Microsoft Research Asia(微软亚洲研究院) Tsinghua University(清华大学)

AI总结 提出WeaveBench基准,包含114个跨8个真实工作领域的长期混合接口任务,要求代理结合GUI和CLI/代码操作,最佳PassRate仅41.2%,揭示现有评估的不足。

详情
AI中文摘要

计算机使用代理(CUA)越来越多地在结合视觉桌面控制、命令行执行、代码编辑、浏览器和外部工具的运行时中运行。然而,现有基准通常将这些接口作为可分离的能力进行评估,导致长期跨接口编排测试不足。因此,我们引入了WeaveBench,一个长期混合接口基准,包含114个跨8个真实工作领域的任务,基于真实用户请求和公开可验证的工件。每个任务要求代理在单个轨迹中结合GUI观察/操作与CLI/代码操作。我们在部署的CLI代理运行时内的真实Ubuntu桌面上评估这些任务,并增加了最小的桌面控制插件。我们还提出了一个配套的轨迹感知评判器,检查交付物、文件、截图、日志和操作痕迹,同时检测快捷行为,如伪造的视觉证据或硬编码指标。在前沿模型-运行时配对中,最佳PassRate仅达到41.2%,表明该基准远未饱和。轨迹感知评判器进一步揭示,仅基于结果的评分显著高估了代理性能。总体而言,WeaveBench暴露了CUA评估中的关键差距,并提供了一个有效的测试平台,以衡量代理是否能在长期真实世界任务中编排GUI、CLI和代码操作。

英文摘要

Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestration under-tested. Thus, we introduce WeaveBench, a long-horizon hybrid-interface benchmark with 114 tasks across 8 real-world work domains, grounded in real user requests and publicly verifiable artifacts. Each task requires agents to combine GUI observations/actions with CLI/code operations within a single trajectory. We evaluate these tasks on a real Ubuntu desktop inside deployed CLI-agent runtimes, augmented with a minimal desktop-control plugin. We also propose a companion trajectory-aware judge that inspects deliverables, files, screenshots, logs, and action traces, while detecting shortcut behaviors such as fabricated visual evidence or hard-coded metrics. Across frontier model-runtime pairings, the best PassRate reaches only 41.2%, showing the benchmark remains far from saturated. The trajectory-aware judge further reveals that outcome-only grading substantially overestimates agent performance. Overall, WeaveBench exposes a critical gap in CUA evaluation and provides an effective testbed to measure whether agents can orchestrate GUI, CLI, and code operations across long-horizon real-world tasks.

2606.09347 2026-06-11 cs.CV 版本更新

IB-HFN: Information Bottleneck-Driven SAR-Optical Fusion Network for High-Fidelity Cloud Removal

IB-HFN: 信息瓶颈驱动的SAR-光学融合网络用于高保真云去除

Haojun Guo, Fan Feng, Ziquan Wang, Yongsheng Zhang, Ying Yu

发表机构 * Institute of Geospatial Information, Information Engineering University(测绘信息研究院,信息工程大学)

AI总结 提出IB-HFN网络,通过双流骨干、空间信息瓶颈融合模块和联合优化策略,抑制SAR散斑噪声并保留光学细节,实现高保真云去除。

详情
AI中文摘要

合成孔径雷达(SAR)辅助的光学云去除旨在利用互补的SAR观测恢复光学遥感图像中被云遮挡的地表信息。现有的多模态融合方法通常依赖于直接的空间拼接和像素级监督,这会将SAR散斑噪声传播到光学重建中,并导致结果过度平滑。为了解决这些局限性,我们提出了一种信息瓶颈驱动的高保真网络(IB-HFN),用于SAR辅助的光学云去除。IB-HFN采用双流骨干网络,在深度语义融合前保留模态特定表示,从而减轻过早的跨模态污染。在融合阶段,我们引入了一个空间信息瓶颈融合模块,通过通道级变分信息瓶颈压缩SAR特征以抑制非结构化散斑噪声。同时,一个局部-全局门控机制预测晴空区域,并通过Dirac初始化的跳跃连接传递可靠的光学细节,将噪声抑制与纹理保留解耦。我们进一步开发了一种联合优化策略,将特征级瓶颈正则化与图像级约束(包括重建精度、结构一致性、光谱保真度和对比度锐度)相结合。动态权重调度平衡这些目标以稳定训练并减少雾状伪影。在SEN12MS-CR数据集上具有挑战性的时空分割下的实验表明,IB-HFN在结构保留和光谱保真度方面优于现有方法。

英文摘要

Synthetic aperture radar (SAR)-assisted optical cloud removal aims to recover surface information obscured by clouds in optical remote sensing images by exploiting complementary SAR observations. Existing multimodal fusion methods typically rely on direct spatial concatenation and pixel-wise supervision, which can propagate SAR speckle noise into optical reconstruction and lead to over-smoothed results. To address these limitations, we propose an Information Bottleneck-driven High-Fidelity Network (IB-HFN) for SAR-assisted optical cloud removal. IB-HFN employs a dual-stream backbone to preserve modality-specific representations before deep semantic fusion, thereby mitigating premature cross-modal contamination. At the fusion stage, we introduce a Spatial Information Bottleneck Fusion module that compresses SAR features through a channel-wise variational information bottleneck to suppress unstructured speckle noise. In parallel, a local-global gating mechanism predicts clear-sky regions and routes reliable optical details through a Dirac-initialized skip connection, decoupling noise suppression from texture preservation. We further develop a joint optimization strategy that integrates feature-level bottleneck regularization with image-level constraints on reconstruction accuracy, structural consistency, spectral fidelity, and contrastive sharpness. A dynamic weighting schedule balances these objectives to stabilize training and reduce hazy artifacts. Experiments on the SEN12MS-CR dataset under challenging spatio-temporal splits demonstrate that IB-HFN achieves superior structural preservation and spectral fidelity over existing methods.

2606.09289 2026-06-11 cs.LG 版本更新

Intention Driven Identification of In-Possession Match Phases in Association Football through Temporal Graph Learning

通过时序图学习识别足球比赛中控球阶段的意图驱动方法

Yuesen Li, Daniel Link

发表机构 * Technical University of Munich(慕尼黑工业大学)

AI总结 提出基于时序图注意力网络(T-GAN)的框架,从时空追踪数据中识别足球比赛控球阶段,实现战术意图(入侵空间、保持控球、得分)和六个子阶段的分类,F1分数达0.87(意图级)和0.79(得分阶段)。

Comments 27 pages, 10 figures

详情
AI中文摘要

理解足球(以下简称足球)的战术组织需要识别不同的比赛阶段。然而,控球阶段很少直接可观察,而是由不断演变的战术意图塑造,而非仅靠空间模式。本研究提出一个数据驱动框架,用于从时空追踪数据中识别控球比赛阶段。分析了七场德国足球甲级联赛比赛,使用TRACAB以25 Hz记录。定义了一个层次化阶段模型,包含三种战术意图(入侵对手空间、保持控球、得分)和六个阶段(构建、推进、反击、维持、持续威胁、完成)。开发了时序图注意力网络(T-GAN),结合帧级球员交互图、上下文特征和基于Transformer的时序建模。使用帧级F1和序列感知的Truth-Dominance交并比(IoT-D)指标评估性能。T-GAN在意图级别达到宏平均帧级F1分数0.87,入侵相关阶段0.76,得分阶段0.79。在序列级别,后处理后意图的平均对角线IoT-D F1从0.68增加到0.79,阶段从0.61增加到0.71,表明时序连贯性改善。模型比较显示,序列建模是分割质量的主要驱动因素,而基于图的关系建模特别有利于反击识别。探索性球员注意力分析进一步表明,边路和中场位置组对阶段区分贡献显著。总体而言,该框架将连续追踪数据转化为战术可解释的控球阶段表示,具有自动比赛标注、战术分析和打法特征分析的潜在应用。

英文摘要

Understanding tactical organisation of association football, hereafter referred to as football, requires identifying distinct match phases. Yet in-possession phases are rarely directly observable and are shaped by evolving tactical intentions, rather than spatial patterns alone. This study proposes a data-driven framework for identifying in-possession match phases from spatiotemporal tracking data. Seven German Bundesliga matches recorded at 25 Hz with TRACAB were analysed. A hierarchical phase model was defined with three tactical intentions (Invade Opponent Space, Keep Possession, Scoring) and six phases (Build Up, Progression, Counter Attack, Maintenance, Sustained Threat, Finishing). A Temporal Graph Attention Network (T-GAN) was developed to combine frame-level player-interaction graphs, contextual features, and Transformer-based temporal modelling. Performance was evaluated using frame-level F1 and a sequence-aware Intersection over Truth-Dominance (IoT-D) metric. T-GAN achieved macro-average frame-level F1 scores of 0.87 at the intention level, 0.76 for invasion-related phases, and 0.79 for scoring phases. At the sequence level, mean diagonal IoT-D F1 increased from 0.68 to 0.79 for intentions and from 0.61 to 0.71 for phases after post-processing, indicating improved temporal coherence. Model comparisons showed that sequence modelling was the main driver of segmentation quality, while graph-based relational modelling was particularly beneficial for Counter Attack recognition. Exploratory player attention analysis further suggested that wide and midfield positional groups contributed strongly to phase discrimination. Overall, the framework translates continuous tracking data into tactically interpretable in-possession phase representations, with potential applications in automated match annotation, tactical analysis, and playing-style profiling.

2606.09287 2026-06-11 cs.LG 版本更新

Trajectory Geometry of Transformer Representations Across Layers

Transformer表示在层间的轨迹几何

Vishal Pandey, Gopal Singh, Yacine Mahdid

发表机构 * MetriQual London, UK(英国伦敦) Athens, GR(希腊雅典)

AI总结 通过计算轨迹长度、曲率等几何指标,发现语义相关提示在中间层收敛、推理任务曲率更大、歧义token轨迹分叉,并揭示三层结构。

Comments 18 pages, 9 figures

详情
AI中文摘要

理解Transformer表示如何跨层演化,而不仅仅是它们编码了什么,仍然是机械可解释性中的一个开放问题。我们将Transformer前向传播重新解释为通过高维表示流形的离散群体轨迹,借鉴了计算神经科学的几何工具。我们不是探测预定义的特征,而是使用直接在环境空间中计算的五个指标来表征轨迹几何:轨迹长度、曲率、语义收敛指数、逐层余弦相似度和表示稳定性。在三个模型家族(GPT-2、TinyLlama、Qwen2.5)和五个受控提示家族中,我们报告了四个发现。首先,语义相关的提示在中间到后期层显著收敛(峰值CI 0.41--0.58,p<0.001,Mann-Whitney U),与吸引子动力学一致。其次,推理任务产生的轨迹曲率大于词汇变化(0.71--0.83弧度 vs. 0.27--0.31弧度),表明曲率编码了计算复杂度。第三,歧义token表现出轨迹分叉,在最后一层表示分离高达5.6倍,而在无歧义控制中则没有。第四,逐层余弦相似度揭示了一个普遍的三阶段结构:编码、精化和输出准备,在所有三种架构中一致。所有四个效应在打乱层和随机嵌入控制下消失。我们发布了一个完全开源、模型无关的管道,并认为轨迹几何构成了一个原则性的、无探针的机械可解释性视角。

英文摘要

Understanding how transformer representations evolve across layers, not merely what they encode, remains an open problem in mechanistic interpretability. We recast the transformer forward pass as a discrete population trajectory through a high-dimensional representation manifold, drawing on geometric tools from computational neuroscience. Rather than probing for pre-specified features, we characterize trajectory geometry using five metrics computed directly in the ambient space: trajectory length, curvature, a semantic convergence index, layerwise cosine similarity, and representational stability. Across three model families (GPT-2, TinyLlama, Qwen2.5) and five controlled prompt families, we report four findings. First, semantically related prompts converge significantly in middle-to-late layers (peak CI 0.41--0.58, p<0.001, Mann-Whitney U), consistent with attractor-like dynamics. Second, reasoning tasks produce trajectories of greater curvature than lexical variations (0.71--0.83 rad vs. 0.27--0.31 rad), suggesting curvature encodes computational complexity. Third, ambiguous tokens exhibit trajectory bifurcation with up to 5.6x representational separation by the final layer, absent in unambiguous controls. Fourth, layerwise cosine similarity reveals a universal three-phase structure: encoding, elaboration, and output preparation, consistent across all three architectures. All four effects vanish under shuffled-layer and random-embedding controls. We release a fully open-source, model-agnostic pipeline and argue that trajectory geometry constitutes a principled, probe-free lens for mechanistic interpretability.

2606.09105 2026-06-11 cs.AI 版本更新

Graph2Idea:Retrieval-Augmented Scientific Idea Generation with Graph-Structured Contexts

Graph2Idea:基于检索增强的图结构上下文科学想法生成

Xu Li, Hanzhe Tu, Xun Han

发表机构 * Southwest Petroleum University(西南石油大学) Sichuan Police College(四川警察学院)

AI总结 提出Graph2Idea框架,利用知识图谱将检索文献转化为结构化三元组,提取图衍生上下文,通过两阶段生成过程提高科学想法的新颖性、质量和可行性。

详情
AI中文摘要

生成新颖、可行且高质量的研究想法是科学发现中重要但具有挑战性的任务。近期基于大语言模型(LLM)的方法通常通过检索文献来支撑想法生成,但检索到的证据通常以平面文本形式提供,如标题、摘要或总结。这种平面上下文可能包含冗余或弱相关信息,同时使得问题、方法、机制和发现之间的跨论文关系难以识别和追踪。为解决这一挑战,我们提出Graph2Idea,一种知识图谱引导的检索增强科学想法生成框架。Graph2Idea首先根据输入主题检索论文,将其转化为结构化知识三元组,并动态构建以目标为中心的知识图谱,使文献关系明确化。然后,它提取紧凑的图衍生上下文,保留与目标相关的关系证据,同时减少噪声文本输入。基于这些上下文,两阶段生成过程首先识别有前景的研究方向,然后引导LLM从图基础证据中综合候选想法。在科学想法生成基准上的实验表明,Graph2Idea在自动评估协议下优于代表性基线。与最强基线分数相比,它将新颖性从0.45提升至0.52,质量从0.24提升至0.29,可行性从0.22提升至0.28。这些结果表明,图结构证据有助于LLM通过更明确、紧凑和可追溯的先前科学知识重组来生成研究想法。

英文摘要

Generating novel, feasible, and high-quality research ideas is an important yet challenging task in scientific discovery. Recent Large Language Model (LLM)-based methods often ground idea generation with retrieved literature, but the retrieved evidence is usually provided as flat text, such as titles, abstracts, or summaries. Such flat contexts may contain redundant or weakly relevant information, while making cross-paper relations among problems, methods, mechanisms, and findings difficult to identify and trace. To address this challenge, we propose Graph2Idea, a knowledge graph-guided framework for retrieval-augmented scientific idea generation.Graph2Idea first retrieves papers according to the input topic, transforms them into structured knowledge triples, and dynamically constructs a target-centered knowledge graph to make literature relations explicit. It then extracts compact graph-derived contexts that retain target-relevant relational evidence while reducing noisy textual input. Based on these contexts, a two-stage generation process first identifies promising research directions and then guides the LLM to synthesize candidate ideas from graph-grounded evidence. Experiments on a scientific idea generation benchmark show that Graph2Idea outperforms representative baselines under the automatic evaluation protocol. Compared with the strongest baseline scores, it improves Novelty from 0.45 to 0.52, Quality from 0.24 to 0.29, and Feasibility from 0.22 to 0.28. These results suggest that graph-structured evidence helps LLMs generate research ideas through more explicit, compact, and traceable recombination of prior scientific knowledge.

2606.08956 2026-06-11 cs.LG 版本更新

From inverse problems to neural operators: prediction, mechanism, and generalization of data-driven models

从反问题到神经算子:数据驱动模型的预测、机制与泛化

Conor Rowan

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 本文从哲学视角统一反问题、稀疏辨识、神经常微分方程和神经算子等数据驱动建模策略,指出它们仅在输入-输出关系的模型类假设上不同,并论证只有某些模型能发现机制并实现泛化。

详情
AI中文摘要

科学家历来依赖基于微分方程的数学模型来关联系统输入(力、通量或热源)与输出(位移、速度、浓度和温度)。这些模型依赖深厚的领域知识来确定控制微分方程的形式,然后通过求解反问题用数据校准。近年来,科学机器学习领域引入了多种针对物理系统的替代建模策略。一种称为非线性动力学稀疏辨识的方法,将控制方程学习为用户定义库中项的稀疏线性组合。神经常微分方程通过将状态及其导数输入神经网络来构建控制方程。神经算子则完全摒弃微分方程的建模框架,直接学习系统输入与输出之间的非线性映射。从反问题到神经算子,所有这些建模策略都可以概念化为数据驱动机制,用于预测系统在一系列输入下的响应。因此,自然会思考这些不同策略之间究竟如何关联,以及它们能否被清晰地分类。借鉴科学模型的哲学文献,我们认为许多模型类型具有共同结构,仅在其定义的输入-输出关系的假设模型类上有所不同。联系关于机制的哲学观点,并论证物理系统的数据来自简洁微分方程的解,我们提出只有某些模型能够发现机制,从而实现泛化。我们的分析旨在统一看似不同的建模策略,并为其适当使用场景提供见解。

英文摘要

Scientists have historically relied on mathematical models based on differential equations to relate system inputs -- forces, fluxes, or heat sources -- to outputs, such as displacement, velocity, concentration, and temperature. These models rely on deep domain knowledge to determine the form of the governing differential equation, which is then calibrated with data by solving an inverse problem. In recent years, the field of Scientific Machine Learning has introduced a variety of alternative modeling strategies for physical systems. A method called Sparse Identification of Nonlinear Dynamics learns the governing equation as a sparse linear combination of terms in a user-defined library. Neural Ordinary Differential Equations construct the governing equation by taking in the state and its derivatives at the input layer of a neural network. Entirely foregoing the modeling framework of differential equations, neural operators directly learn a non-linear mapping between the system inputs and outputs. From inverse problems to neural operators, all of these modeling strategies can be conceptualized as data-driven machinery to predict a system's response over a range of inputs. It is then natural to wonder how exactly these various strategies relate to each other, and whether they can be neatly taxonomized. Drawing from the philosophical literature on scientific models, we argue that many model types have a common structure, differing only in the assumed model class of the input-output relation they define. Connecting to philosophical ideas on mechanism, and arguing that data from physical systems arises from solutions to parsimonious differential equations, we propose that only certain models are capable of mechanism discovery, and thus generalization. Our analysis is intended to unite apparently disparate modeling strategies and provide insight into their appropriate use cases.

2606.08744 2026-06-11 cs.CV 版本更新

MB-Loc: Multi-planar Bird's-eye-view Localization in outdoor LiDAR scenes

MB-Loc:室外LiDAR场景中的多平面鸟瞰图定位

Ayaan Choudhury, Preet Savalia, Anirudh Pydah, Avinash Sharma

发表机构 * Indian Institute of Technology Jodhpur(印度理工学院焦特布尔分校)

AI总结 提出MB-Loc框架,通过将LiDAR扫描投影为2.5D多平面鸟瞰图表示,结合KL正则化隐瓶颈和3D空间增强,实现轻量级、视角鲁棒的场景坐标回归定位,在NCLT数据集上达到实时推理并超越现有方法。

详情
AI中文摘要

全局LiDAR定位是自主导航系统的基本任务。最近的方法通过预测密集的3D世界坐标进行场景坐标回归(SCR),相比绝对位姿回归(APR)方法实现了更高的精度。然而,SCR方法引入了两个主要瓶颈:处理原始3D几何结构导致的严重计算低效,以及在不同传感器视角下性能显著下降。为了解决这些限制,我们提出了MB-Loc,一个轻量级且视角鲁棒的SCR框架。我们不依赖沉重的3D卷积,而是将输入的LiDAR扫描投影为2.5D多平面鸟瞰图(BEV)表示。通过沿Z轴切片点云并将有符号深度映射到离散的2D平面,MB-Loc保留了关键的3D几何结构,同时利用了标准2D CNN的计算可处理性。为了处理室外LiDAR固有的稀疏性,我们引入了一个KL正则化的隐瓶颈,该瓶颈在不注入随机噪声的情况下显式建模空间不确定性。最后,为了确保旋转鲁棒性,我们在平面投影之前应用3D空间增强,迫使网络隐式学习视角不变的特征。我们在公开的NCLT数据集上进行了大量实验,证明了我们提出的方法优于当前最先进的方法。以实时推理速度运行,MB-Loc在计算效率上显著优于传统的3D-SCR架构。

英文摘要

Global LiDAR localization is a fundamental task for autonomous navigation systems. Recent methods perform Scene Coordinate Regression (SCR) and achieve superior accuracy over Absolute Pose Regression (APR) solutions by predicting dense 3D world coordinates. However, SCR approaches introduce two major bottlenecks: severe computational inefficiency from processing raw 3D geometries and significant performance degradation under varying sensor viewpoints. To address these limitations, we present MB-Loc, a lightweight and viewpoint-robust SCR framework. Instead of relying on heavy 3D convolutions, we project the input LiDAR scan into a 2.5D Multi-planar Bird's-Eye View (BEV) representation. By slicing the point-cloud along the Z-axis and mapping signed depths into discrete 2D planes, MB-Loc retains essential 3D geometric structures while exploiting the computational tractability of standard 2D CNNs. To handle the inherent sparsity of outdoor LiDAR, we introduce a KL-regularized latent bottleneck that explicitly models spatial uncertainty without injecting stochastic noise. Finally, to ensure rotation robustness, we apply 3D spatial augmentations prior to planar projection, forcing the network to implicitly learn viewpoint-invariant features. We perform extensive experiments on the publicly available NCLT dataset and demonstrate that our proposed method outperforms the current state-of-the-art. Operating at real-time inference speeds, MB-Loc significantly outperforms traditional 3D-SCR architectures in computational efficiency.

2606.08530 2026-06-11 cs.RO cs.AI 版本更新

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

GEAR-VLA:学习几何感知的动作表示以实现可泛化的机器人操作

Yuan Zhang, Shiqi Zhang, Yedong Shen, Shuai Dong, Jiajun Deng, Xin Zhang, Yuxuan Gao, Jiajia Wu, Xin Nie, Zhiyuan Cheng, Jianmin Ji, Yanyong Zhang, Xingyi Zhang, Jia Pan

发表机构 * Anhui University(安徽大学) University of Science and Technology of China(中国科学技术大学) iFLYTEK(科大讯飞)

AI总结 提出GEAR-VLA框架,通过粗到细的动作学习、语义对齐的3D集成和具身规范化,学习统一的几何感知动作表示,实现跨物体、背景和机器人的泛化操作。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在基准测试中表现强劲,但在实际部署中仍难以应对未见过的物体、背景变化和不同的机器人本体。我们认为这源于缺乏统一的几何感知操作表示,使得现有VLA容易受到低级轨迹监督、不对齐的3D特征和本体差异的影响。为此,我们提出GEAR-VLA,一个用于学习统一几何感知动作表示以实现可泛化机器人操作的VLA框架。GEAR-VLA采用粗到细的动作学习,其中多源具身预训练赋予VLM具身推理和离散动作理解能力,随后潜在动作标记将动作语义连接到梯度解耦的DiT连续动作专家。它通过将可训练的3D空间骨干与VLA表示对齐,同时冻结原始VLM对齐的视觉通路,进一步执行语义对齐的3D集成。为了跨机器人共享该表示,GEAR-VLA使用具身规范化,其中具身感知状态和具身不变动作将机器人差异限制在低级接口。大量的仿真和真实实验证明了强大的泛化能力:GEAR-VLA在LIBERO、零样本LIBERO-Plus和RoboTwin 2.0上达到了最先进的性能,在AgileX上达到85.9%的成功率,在预训练未见过的LDT-01本体上达到81.0%,并在包含212个未见物体的6,360次试验通用抓取基准上获得90.1%的成功率。代码和模型将在https://github.com/babynabeauty/GEAR-VLA发布。

英文摘要

Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at https://github.com/babynabeauty/GEAR-VLA.

2606.08415 2026-06-11 cs.CV cs.AI 版本更新

CoVEBench: Can Video Editing Models Handle Complex Instructions?

CoVEBench: 视频编辑模型能处理复杂指令吗?

Jiangtao Wu, Jiaming Wang, Yiwen He, Yuanxing Zhang, Shihao Li, Dunyuan Liu, Xuedong Zhao, Jialu Chen, Zekun Moore Wang, Jiaheng Liu

发表机构 * Nanjing University(南京大学) Kuaishou Technology(快手科技)

AI总结 提出CoVEBench基准,包含416个源视频和626条多点编辑指令,通过MLLM评估指令遵循度和保真度,揭示当前模型在组合编辑中常遗漏编辑或破坏保留约束。

Comments 34 pages, 11 figures, 9 tables

详情
AI中文摘要

虽然近期基于文本引导的视频编辑模型在基础任务(如风格迁移、物体插入)上表现出色,但现实用户请求具有高度组合性。单个提示通常要求多个耦合编辑,例如同时修改主体、动作和相机视角,同时严格保留无关的时空内容。现有基准受限于孤立编辑和粗粒度全局指标,无法诊断模型如何处理此类复杂工作流。为弥补这一空白,我们引入CoVEBench,一个组合视频编辑基准,包含416个精心策划的源视频、626条多点编辑指令和9,990个细粒度检查项。CoVEBench覆盖多样化的编辑维度,通过MLLM评判的指令遵循度和视频保真度,以及视频质量的自动指标来评估模型。大量实验表明,组合编辑仍然是一个深层次的挑战:当前模型在处理多个操作同时进行时,经常遗漏编辑、违反保留约束或引入伪影。CoVEBench为推进视频编辑向现实用户工作流发展提供了一个具有挑战性的诊断测试平台。

英文摘要

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.

2606.08343 2026-06-11 cs.LG 版本更新

GENERIC-FNO: Embedding Energy Conservation and Entropy Production into Fourier Neural Operators

GENERIC-FNO:将能量守恒和熵产生嵌入傅里叶神经算子

Jason Sulskis, Sathya Ravi

发表机构 * University of Illinois at Chicago(伊利诺伊大学芝加哥分校) Georgia Tech Research Institute(佐治亚理工学院研究所)

AI总结 提出GENERIC-FNO,首个在函数空间直接嵌入非平衡热力学完整GENERIC结构的神经算子,通过秩一投影精确满足退化条件,实现能量守恒与熵产生,在超分辨率下保持结构保证。

Comments Under review at TMLR

详情
AI中文摘要

我们引入了GENERIC-FNO,这是第一个将非平衡热力学的完整GENERIC(度量-辛)结构——可逆、能量守恒动力学和不可逆、熵产生动力学通过退化条件耦合——直接嵌入函数空间的神经算子。现有的保结构神经算子最多强制执行单一守恒律或可逆(哈密顿)结构,而热力学一致的学习仅限于有限维、图或粒子系统。GENERIC-FNO填补了这一空白:它将能量和熵泛函学习为神经算子,并将泊松和摩擦算子参数化为对角傅里叶乘子,夹在秩一投影之间,通过构造精确满足退化条件,无需惩罚项、更新投影或残差。退化恒等式对任何初始化、维度或分辨率都达到机器精度(残差~10^-13),因此连续时间动力学守恒学习的能量并精确产生熵;显式时间步进仅增加小的O(dt^2)漂移(每步残差~10^-6)。我们进一步指出,给定流的(E,S,L,M)分解并不唯一,并引入了一个规范不变的耗散诊断,独立于学习的泛函分离可逆和耗散动力学。在三个算子主干(1D/2D FNO和DeepONet)和四个涵盖可逆、耗散和混合机制的PDE上,GENERIC-FNO在4倍超分辨率范围(64到256)内零样本保持其精确结构保证,恢复物理耗散的真实顺序,并与强无约束和能量惩罚基线竞争,在相当或更少参数的情况下在多个耗散和混合问题上优于它们。

英文摘要

We introduce GENERIC-FNO, the first neural operator to embed the full GENERIC (metriplectic) structure of nonequilibrium thermodynamics -- reversible, energy-conserving dynamics and irreversible, entropy-producing dynamics coupled through the degeneracy conditions -- directly in function space. Existing structure-preserving neural operators enforce at most a single conservation law or reversible (Hamiltonian) structure, while thermodynamically consistent learning has been confined to finite-dimensional, graph, or particle systems. GENERIC-FNO closes this gap: it learns the energy and entropy functionals as neural operators and parameterizes the Poisson and friction operators as diagonal Fourier multipliers sandwiched between rank-one projections that enforce the degeneracy conditions exactly, by construction, with no penalty term, update projection, or residual. The degeneracy identities hold to machine precision (residuals ~10^-13) for any initialization, dimension, or resolution, so the continuous-time dynamics conserve the learned energy and produce entropy exactly; the explicit time stepping adds only a small O(dt^2) drift (per-step residual ~10^-6). We further note that the (E,S,L,M) decomposition of a given flow is not unique, and introduce a gauge-invariant dissipation diagnostic separating reversible from dissipative dynamics independently of the learned functionals. Across three operator backbones (1D/2D FNOs and DeepONet) and four PDEs spanning reversible, dissipative, and mixed regimes, GENERIC-FNO preserves its exact structural guarantees zero-shot across a 4x super-resolution range (64 to 256), recovers the ground-truth ordering of physical dissipation, and is competitive with strong unconstrained and energy-penalized baselines, outperforming them on several dissipative and mixed problems at comparable or fewer parameters.

2606.08102 2026-06-11 cs.RO cs.AI cs.MA 版本更新

Continual Quadruped Robots Coordination via Semantic Skill Discovery

通过语义技能发现实现持续四足机器人协调

Daoqing Wang, Yuchen Xiao, Weixuan Huang, Zhilong Zhang, Shenghua Wan, Meng Li, Lei Yuan, Yang Yu

发表机构 * National Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China(新型软件技术国家重点实验室,南京大学,南京,中国) School of Artificial Intelligence, Nanjing University, Nanjing, China(人工智能学院,南京大学,南京,中国) Polixir Technologies, Nanjing, China(南京极智科技有限公司)

AI总结 提出Conquer框架,通过语义技能库实现多四足机器人在持续学习任务中的协调,避免灾难性遗忘,最终平均成功率95.6%。

Comments 22 pages, 8 figures, 11 tables. Project page: https://conquer-project.pages.dev/

详情
AI中文摘要

多四足协调因其增强的负载能力、更广的接触覆盖范围以及对挑战性任务的适应性提升而受到越来越多的关注。现有的多四足操作方法通常专注于预定义或封闭的任务族,往往依赖多智能体强化学习(MARL)来训练特定任务的协调策略。然而,这类方法在开放式持续学习场景中难以应对,其中任务顺序到达,机器人期望在复用先前学到的技能的同时获取新协调技能,且不出现灾难性遗忘。为应对这一挑战,我们提出Conquer,一个语义技能库框架,将持续多四足协调形式化为检索-适应-更新过程。首先,为适应不同任务中的团队规模变化,我们设计了一个团队结构的Self-Allies-Goal(SAG)主干,通过显式建模每个机器人自身状态、队友上下文和任务目标,支持可变基数的机器人团队。对于每个新任务,Conquer从执行前信息构建任务级语义描述符,并从技能库中检索相关技能进行适应。成功执行后,Conquer通过提取轨迹级语义描述符并根据语义距离组织它们来更新技能库,从而实现持续技能积累和跨任务知识迁移。仿真实验表明,Conquer达到了95.6%的最终平均成功率,展示了强大的前向迁移能力和可忽略的灾难性遗忘。在宇树Go2团队上的实际部署进一步验证了Conquer用于实际多四足协调的可行性。仿真和真实机器人演示视频见:https://conquer-project.pages.dev/。

英文摘要

Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi-quadruped manipulation typically focus on predefined or closed task families, often relying on multi-agent reinforcement learning (MARL) to train task-specific coordination policies. However, such methods struggle in open-ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill-library framework that formulates continual multi-quadruped coordination as a retrieve-adapt-update process. First, to accommodate varying team sizes across tasks, we design a team-structured Self-Allies-Goal (SAG) backbone that supports variable-cardinality robot teams by explicitly modeling each robot's own state, teammate context, and task goal. For each incoming task, Conquer constructs a task-level semantic descriptor from pre-execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory-level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross-task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real-world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi-quadruped coordination. Simulation and real-robot demonstration videos are available at: https://conquer-project.pages.dev/.

2606.08011 2026-06-11 cs.CL cs.AI 版本更新

Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation

改写以翻译,翻译以奖励:机器翻译中源端改写的强化学习

Boxuan Lyu, Haiyue Song, Zhi Qu, Hidetaka Kamigaito, Kotaro Funakoshi, Manabu Okumura

发表机构 * Institute of Science Tokyo(东京科学大学) Preferred Networks Inc(Preferred Networks 公司) Nara Institute of Science and Technology(奈良先端科学技术大学院大学)

AI总结 提出RLSR框架,通过强化学习训练源端改写模型,以翻译质量提升为奖励,无需为每个MT模型调提示,在6个MT模型和16个语言对上超越无改写和同规模提示基线,与235B LLM提示基线性能相当。

详情
AI中文摘要

尽管直接提示现成的大语言模型(LLM)生成保留意义的源端改写可以有效提升机器翻译(MT)质量,但这样做需要为不同的MT模型手动调整提示。在这项工作中,我们提出了RLSR(用于源端改写的强化学习),一种新颖的基于强化学习的框架,用于训练源端改写模型,而无需为每个MT模型调整提示。RLSR通过直接使用每个改写源端所带来的下游翻译质量的提升作为奖励来优化改写模型。跨六个MT模型和16个语言对的广泛实验表明,我们通过RLSR训练的4B改写模型显著优于无改写基线和现有的同规模基于提示的改写基线,同时与基于235B LLM的提示基线相比取得了具有竞争力的性能。

英文摘要

Rewriting source text with large language models (LLMs) before translation has been shown to improve machine translation (MT) quality. However, we find that prompt-based rewriting can degrade translation quality rather than improve it, particularly when smaller LLMs, such as 4B-parameter models, are used. We argue that this limitation stems from the difficulty of controlling rewriting behavior through natural-language prompts alone: a rewrite is useful only if it improves downstream translation, yet existing prompt-based methods do not explicitly optimize for this signal. To address this issue, we propose RLSR (Reinforcement Learning for Source Rewriting), a reinforcement learning framework that trains the rewriting model with a reward based on the downstream translation-quality improvement produced by each rewrite. Experiments across six MT systems and 16 language pairs show that our 4B RLSR-trained rewriting models significantly outperform both the no-rewriting baseline and prompt-based rewriting baselines at the same model scale, while remaining competitive with baselines that use a 235B LLM.

2606.07909 2026-06-11 cs.AI cs.CL 版本更新

MemToolAgent: Leveraging Memory for Tool Using Agents Based on Environment and User Feedback

MemToolAgent概述:一个简单的餐厅预订场景,其中代理检索相似记忆,接收关于无效时间格式的反馈,并生成反思以更新其记忆

Suleyman Armagan Er, Danilo Ribeiro, Yogesh Virkar, Surafel Lakew, Adi Kalyanpur, James Gung, Thomas Delteil, Arshit Gupta

发表机构 * AWS AI University of Washington(华盛顿大学)

AI总结 提出MemToolAgent框架,通过记忆管理提升大语言模型代理的工具使用能力,包含记忆提取和动态检索模块,在三个基准上分别提升29%、80%和17%。

Comments 8 pages, 5 figures

详情
AI中文摘要

现代大语言模型(LLM)代理可以使用外部工具帮助用户解决复杂任务。然而,对于需要从长期历史事件或先前的代理-环境交互中学习的问题,LLM代理需要使用记忆机制来存储和检索经验。尽管对话代理存在复杂的记忆系统,但很少有研究实证检验如何通过过去的用户-代理对话来提升代理的工具使用能力。我们提出MemToolAgent,一个通过记忆管理改善工具使用的框架。我们的方法包含一个记忆提取模块,将过去的经验处理成结构化的记忆条目,以及一个检索模块,动态选择存储记忆条目的子集。这使得无需LLM微调即可实现更个性化和准确的响应,与用户偏好和反馈保持一致。总之,本工作有三个主要贡献:(1)统一的记忆条目格式,无需LLM微调即可改善通用和个性化工具使用;(2)基于反思的记忆提取,利用环境和用户反馈将错误执行提炼为批评并存储;(3)一个检索模块,根据记忆相似度分布选择使用多少过去经验。MemToolAgent在WorkBench、NESTFUL和PEToolBench基准上相比强基线分别实现了29%、80%和17%的相对改进。

英文摘要

Modern large language model (LLM) agents can use external tools to help users solve complex tasks. However, for problems that require learning from long-term historical events or from previous agent-environment interactions, LLM agents are required to use memory mechanisms to store and retrieve experiences. While sophisticated memory systems exist for dialogue agents, few studies have empirically examined how to improve agents' tool-using capabilities through past user-agent conversations. We propose MemToolAgent, a framework that improves tool use through memory management. Our approach contains a memory extraction module that processes past experiences into structured memory entries, and a retrieval module that dynamically selects a subset of the stored memory entries. This enables more personalized and accurate responses aligned with user preferences and feedback without requiring LLM fine-tuning. In summary, this work has three main contributions: (1) a unified memory entry format that improves both general-purpose and personalized tool use without LLM fine-tuning, (2) a reflection-based memory extraction that uses environment and user feedback to distill wrong executions into critiques to store, and (3) a retrieval module that chooses how many past experiences to use based on the memory similarity distribution. MemToolAgent achieves 29%, 80%, and 17% relative improvements compared to strong baselines on the WorkBench, NESTFUL, and PEToolBench benchmarks, respectively.

2606.07591 2026-06-11 cs.LG cs.AI cs.CL 版本更新

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

ResearchClawBench: 端到端自主科学研究基准

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出ResearchClawBench基准,包含10个领域40个任务,通过多模态评分标准评估自主科研能力,最强智能体仅得21.5分,揭示当前系统在实验协议、证据匹配和科学核心方面的不足。

详情
AI中文摘要

AI编码智能体越来越多地用于科学工作,但其端到端自主研究能力仍然难以验证。我们提出了ResearchClawBench,一个用于评估自主科学研究的基准,涵盖来自10个科学领域的40个任务。每个任务基于一篇真实发表论文,提供相关文献和原始数据,并在评估期间隐藏目标论文。专家策划的多模态评分标准将目标科学制品分解为加权标准,从而能够评估目标论文级别的重新发现,同时为新发现留出空间。我们在统一协议下评估了七个自主研究(auto-research)智能体,并通过轻量级ResearchHarness评估了十七个原生LLM。当前系统远未达到可靠的重新发现:最强的自主智能体Claude Code平均得分为21.5,最强的ResearchHarness LLM Claude-Opus-4.7平均得分为20.7,LLM前沿均值仅为26.5。错误分析表明,失败集中在实验协议不匹配、证据不匹配和缺失科学核心。ResearchClawBench为衡量自主科学研究进展提供了一个可复现的评估前沿。

英文摘要

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

2606.07362 2026-06-11 cs.LG 版本更新

Breaking the Ice: Analyzing Cold Start Latency in vLLM

打破冰层:分析 vLLM 中的冷启动延迟

Huzaifa Shaaban Kabakibo, Animesh Trivedi, Lin Wang

发表机构 * Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country(匿名机构,匿名城市,匿名地区,匿名国家)

AI总结 本文首次系统分析 vLLM 推理引擎的冷启动延迟,将其分解为六个基础步骤,发现主要受 CPU 限制,并建立轻量级分析模型预测延迟,为大规模推理环境资源规划提供指导。

详情
Journal ref
Proceedings of the 9th MLSys Conference, Bellevue, WA, USA, 2026
AI中文摘要

随着可扩展推理服务的普及,推理引擎的冷启动延迟变得重要。如今,vLLM 已成为许多推理工作负载的事实标准推理引擎。尽管流行,但由于其复杂性和快速演进,尚未有对其启动延迟的系统研究。随着主要架构创新如 V1 API 和 this http URL 的引入,本文首次对 vLLM 启动延迟进行了详细的性能表征。我们将启动过程分解为六个基础步骤,并证明其主要受 CPU 限制。每个步骤在模型级和系统级参数方面表现出一致且可解释的缩放趋势,从而能够细粒度地归因延迟来源。基于这些见解,我们开发了一个轻量级分析模型,能够准确预测给定硬件配置下的 vLLM 启动延迟,为大规模推理环境中的资源规划提供可操作的指导。所有基准测试数据集、分析工具和预测脚本均在此 https URL 开源。

英文摘要

As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de facto inference engine of choice for many inference workloads. Although popular, due to its complexity and rapid evolution, there has not been a systematic study of its startup latency. With major architectural innovations such as the V1 API and the introduction of torch.compile, this paper presents the first detailed performance characterization of vLLM startup latency. We break down the startup process into six foundational steps and demonstrate that it is predominantly CPU bound. Each step exhibits consistent and interpretable scaling trends with respect to model-level and system-level parameters, enabling fine-grained attribution of latency sources. Building on these insights, we develop a lightweight analytical model that accurately predicts vLLM startup latency for a given hardware configuration, providing actionable guidance for resource planning in large-scale inference environments. All benchmarking datasets, analysis tools, and prediction scripts are open sourced at https://github.com/upb-cn/vllm-startup-profiler.

2606.06921 2026-06-11 cs.SD 版本更新

Towards Event-Robust Acoustic Scene Classification

面向事件鲁棒的声学场景分类

Yiqiang Cai, Bohan Hu, Yu Yang, Pengwei Lu, Shengchen Li, Xi Shao

发表机构 * Xi'an Jiaotong-Liverpool University(西安交通大学利物浦大学) Zhongdian Zhiheng Information Technology Service Co., Ltd(中电智恒信息技术服务有限公司) China Telecom Jiangsu Branch(中国电信江苏分公司) Nanjing University of Posts and Telecommunications(南京邮电大学)

AI总结 针对现有声学场景分类系统在未知声音事件下性能下降的问题,提出事件移位声学场景数据集ESAS,通过大语言模型注入前景事件模拟真实环境,评估并推动事件鲁棒ASC研究。

Comments Accepted to Interspeech 2026. The ESAS dataset is available at: https://doi.org/10.5281/zenodo.20623264

详情
AI中文摘要

本文介绍了事件移位声学场景(ESAS)数据集,这是一个用于评估声学场景分类(ASC)系统对未知声音事件鲁棒性的新型基准。现有的ASC数据集通常包含干净且一致的音频记录,而现实环境往往包含多样且意外的事件。为弥合这一差距,ESAS通过借助大语言模型将前景声音事件注入背景场景来模拟现实世界的声学变化。本文介绍了构建方法、数据集统计和评估协议。此外,使用ESAS基准对最先进的ASC系统进行了全面评估。实验结果表明,现有的ASC模型在面对事件移位挑战时性能显著下降。ESAS数据集的引入旨在推动未来研究朝向事件鲁棒的ASC发展。

英文摘要

This paper introduces the Event-Shifted Acoustic Scene (ESAS) dataset, a novel benchmark for evaluating the robustness of Acoustic Scene Classification (ASC) systems against unknown sound events. Existing ASC datasets typically contain recordings of clean and consistent audio, while real-world environments often include diverse and unexpected sound events. To bridge this gap, ESAS simulates real-world acoustic variability by injecting foreground sound events into background scenes with the assistance of large language models. In this work, we present the construction methodology, dataset statistics, and evaluation protocols. Furthermore, a comprehensive evaluation of state-of-the-art ASC systems is conducted using the ESAS benchmark. Experimental results reveal that existing ASC models suffer significant performance degradation when facing the event-shift challenge. The introduction of the ESAS dataset aims to drive future research toward event-robust ASC.

2606.06904 2026-06-11 cs.RO cs.CV 版本更新

ActionMap: Robot Policy Learning via Voxel Action Heatmap

ActionMap: 基于体素动作热图的机器人策略学习

Pei Yang, Hai Ci, Yanzhe Chen, Qi Lv, Han Cai, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(新加坡国立大学Show实验室) NVIDIA(英伟达)

AI总结 提出ActionMap,一种将动作空间建模为体素热图的动作解码器,替代现有VLA模型中的单点预测器,在LIBERO仿真和真实Franka操作中提升性能和数据效率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在骨干网络、训练方法和数据规模方面快速发展,但将骨干网络隐藏状态转换为连续控制信号的动作解码器几乎没有变化,在大多数现有VLA中仍然是单点预测器。无论是通过自回归词元箱、L1回归还是流匹配去噪实现,所得解码器都将动作空间视为无结构的,在训练期间未利用相邻动作的几何邻近性。为了改进这一点,我们引入了ActionMap,一种体素热图动作头,可以插入现有VLA中替换其原生动作解码器。对于每个新动作,该头预测动作空间上的体素热图,其中每个体素直接存储对应动作的概率。在LIBERO仿真和真实Franka操作中,我们的热图头在匹配训练步数下超越了两种架构不同的骨干网络(例如,在LIBERO四套件平均上比OpenVLA-OFT的L1回归头高出8.2%),在两种骨干网络上以相当或更快的速度收敛,并且在低训练数据下保持显著更高的数据效率。跨骨干网络的一致性表明,动作表示是VLA性能的一个真正杠杆,与进一步的骨干网络或方法缩放不同。项目页面:此 https URL。

英文摘要

Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone's hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT's L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: https://showlab.github.io/ActionMap/.

2606.06065 2026-06-11 cs.CL cs.SD eess.AS 版本更新

Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition

多任务学习还不够:双输出第二语言语音识别中的表示纠缠

Seung Hwan Cho, Young-Min Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 针对双输出第二语言语音识别,研究发现多任务学习导致表面转录性能下降,归因于编码器级别的表示纠缠,尤其在英语中随表面-意义差异增大而加剧。

Comments 5 pages, 2 figures, Accepted to the 43rd International Conference on Machine Learning Workshop on Machine Learning for Audio

详情
AI中文摘要

第二语言(L2)语音识别通常需要发音转录和预期意义的转录。多任务学习(MTL)是一种自然的方法,因为它假设共享表示对两个输出都有益。然而,本文表明这一假设在韩语和英语中并不成立。MTL提高了意义转录但降低了表面转录,尤其是在英语中,性能下降与通过Levenshtein编辑距离测量的表面-意义差异成正比。编码器分析将这些模式与编码器级别的纠缠联系起来,韩语保留了不同的任务表示,而英语产生了几乎相同的表示。跨任务解码器分析表明,意义双输出解码器适应了独特的表示,而表面双输出解码器仍受编码器约束。这些发现促使设计能够减轻编码器级别纠缠的MTL框架,以减少双输出L2自动语音识别中的表面性能下降。

英文摘要

Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs. However, this paper shows that this assumption does not hold across Korean and English. MTL improves meaning but degrades surface transcription, especially in English, where the degradation scales with surface-meaning divergence measured by Levenshtein edit distance. Encoder analysis links these patterns to encoder-level entanglement, with Korean preserving distinct task representations while English produces nearly identical ones. Cross-task decoder analysis shows that the meaning dual-output decoder adapts with a unique representation, while the surface dual-output decoder remains constrained by the encoder. These findings motivate the design of MTL frameworks that mitigate encoder-level entanglement to reduce surface degradation in dual-output L2 automatic speech recognition.

2606.05922 2026-06-11 cs.AI cs.CL cs.LG 版本更新

Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference

回顾性工具优化:通过轨迹回滚上的自我偏好改进LLM智能体

Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia

发表机构 * City University of Hong Kong(香港城市大学) Microsoft Research Asia(微软亚洲研究院)

AI总结 提出一种自监督方法RHO,利用历史轨迹回滚和自偏好选择优化智能体工具集,无需真实标签,在SWE-Bench Pro上通过单轮优化将通过率从59%提升至78%。

Comments Code: https://github.com/wbopan/retro-harness ; Project website: https://paper-rho.wenbo.io

详情
AI中文摘要

AI智能体依赖于技能、工具和工作流程的整合(称为工具集)来解决复杂问题。持续改进这一工具集对于适应新任务至关重要。然而,现有的优化方法通常需要真实验证集,但在实际部署场景中获取此类标注数据非常困难。为解决这一问题,我们提出回顾性工具优化(RHO),一种仅利用过去轨迹的自监督方法。具体而言,RHO从历史轨迹中选择一个多样化的困难任务核心集,并并行重新求解。智能体通过自我验证和自我一致性分析这些回滚,然后生成候选工具集更新,并通过自身的成对自我偏好选择最有效的更新。我们在三个不同领域(涵盖软件工程、技术工作和知识工作)上评估RHO。值得注意的是,单轮优化无需任何外部评分即可将SWE-Bench Pro上的通过率从59%提升至78%。此外,我们的分析表明RHO有效针对先前的失败模式。因此,优化后的工具集改变了智能体的行为模式,并在长周期会话中保持更高的准确性。

英文摘要

AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.

2606.05394 2026-06-11 cs.SD eess.AS 版本更新

nnAudio 2: Overcoming Dynamic Compilation Barriers and Transform Inconsistencies

nnAudio 2: 克服动态编译障碍与变换不一致性

Abhinaba Roy, Junyi Liang, Dorien Herremans

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 针对 nnAudio 在 TorchScript 编译、逆变换边缘情况和依赖漂移方面的问题,通过移除动态状态变异、限制逆变换适用范围并更新依赖,实现了与现代 PyTorch 和 SciPy 的兼容,提升了可微音频分析的鲁棒性。

详情
AI中文摘要

nnAudio 是一个用于深度学习的开源音频特征提取工具箱,但在当前环境中,其使用受到 TorchScript 不兼容、逆变换边缘情况和依赖漂移的阻碍。我们针对现代 PyTorch 和科学 Python 进行了有针对性的现代化改造。我们通过从脚本化代码路径中移除动态状态变异和模块构造,并收紧逆相关辅助函数中的参数处理,解决了 STFT 和 iSTFT 中的 TorchScript 编译失败问题。我们通过将可靠逆变换限制为均匀 bin 设置(freq_scale='no'),并对不支持的频率尺度引发显式运行时错误,澄清了逆 STFT 行为,防止了静默退化的重构。我们恢复了与现代 SciPy 的 CFP 兼容性,并确保当 gamma = 0 时 VQT 简化为 CQT。回归测试涵盖了新的 STFT/iSTFT 行为,更新后的代码库在现代 Python 环境中通过了完整的仓库测试套件。这些改进为研究和部署中的可微音频分析提供了更坚实的基础。

英文摘要

nnAudio is an open-source audio feature extraction toolbox for deep learning, but its use in current environments is hindered by TorchScript incompatibilities, inverse-transform edge cases, and dependency drift. We present a targeted modernization for modern PyTorch and scientific Python. We resolve TorchScript compilation failures in STFT and iSTFT by removing dynamic state mutation and module construction from scripted code paths and tightening argument handling in inverse-related helpers. We clarify inverse-STFT behavior by restricting reliable inversion to the uniform-bin setting (freq_scale=`no') and raising explicit runtime errors for unsupported frequency scales, preventing silently degraded reconstructions. We restore CFP compatibility with modern SciPy and ensure VQT reduces to CQT when gamma = 0. Regression tests cover the new STFT/iSTFT behaviors, and the updated codebase passes the full repository test suite in a modern Python environment. These improvements provide a more robust foundation for differentiable audio analysis in research and deployment.

2606.04694 2026-06-11 cs.CL 版本更新

DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

DuDi: 跨语言动词化的双信号蒸馏

Patomporn Payoungkhamdee, Tinnakit Udsa, Jian Gang Ngui, Sarana Nutanong, Alham Fikri Aji, Peerat Limkonchotiwat

发表机构 * School of Information Science and Technology, VISTEC(信息科学与技术学院,VISTEC) AI Singapore(AI新加坡) MBZUAI

AI总结 提出DuDi框架,通过结合序列级和词元级信号以及跨语言动词化器,提升小语言模型在多语言(尤其是东南亚语言)上的性能。

详情
AI中文摘要

小语言模型(SLM)高效且可扩展,但其多语言能力在十亿以下规模时严重下降,尤其是对于东南亚(SEA)语言。我们引入了DuDi,一个双信号多语言蒸馏框架,它结合了在线序列级信号与离策略和在线词元级信号。DuDi进一步使用跨语言动词化器来优化教师反馈并提高多语言设置下教师-学生的可迁移性。在SEA-HELM上跨多个模型家族、规模和教师-学生设置的实验表明,DuDi始终优于竞争性的蒸馏基线。消融和分析证实,序列级优化、词元级监督和跨语言动词化为多语言SLM提供了互补且可迁移的学习信号。

英文摘要

Small language models (SLMs) are efficient and scalable, but their multilingual capabilities degrade severely at sub-billion scales, especially for Southeast Asian (SEA) languages. We introduce DuDi, a dual-signal multilingual distillation framework that combines an online sequence-level signal with off-policy and on-policy token-level signals. DuDi further uses a cross-lingual verbalizer to refine teacher feedback and improve teacher-student transferability in multilingual settings. Experiments on SEA-HELM across multiple model families, scales, and teacher-student settings show that DuDi consistently outperforms competitive distillation baselines. Ablations and analyses confirm that sequence-level optimization, token-level supervision, and cross-lingual verbalization provide complementary and transferable learning signals for multilingual SLMs.

2606.04351 2026-06-11 cs.CV cs.CL 版本更新

Frames2LoRA: Parametric Video Internalization for Vision-Language Models

Video2LoRA: 视觉-语言模型的参数化视频内化

Manan Suri, Sarvesh Baskar, Dinesh Manocha

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 提出Video2LoRA方法,通过感知器超网络从视频编码中直接生成LoRA适配器,实现零视觉令牌的视频查询,在保持性能的同时大幅降低计算成本。

Comments https://frames2lora.github.io/

详情
AI中文摘要

在视觉-语言模型中处理视频成本高昂:每帧占用数百个令牌,推理成本随每帧和每次重复查询而增加。我们引入Video2LoRA,一种参数化视频内化方法。感知器超网络逐层读取冻结VLM编码视频时产生的中间表示,并在单次前向传播中生成低秩适配(LoRA)适配器。与需要迭代梯度更新的标准LoRA微调不同,Video2LoRA直接从视频预测这些权重。在SmolVLM2 500M和2.2B上针对视频摘要和描述进行训练后,Video2LoRA使得相同的冻结VLM能够仅通过适配器回答查询,在查询时上下文中零视觉令牌。Video2LoRA在两种模型规模的所有五个描述基准测试中,以及在八个视频问答基准测试-规模配对中的七个上,统计上非劣效且等同于直接视频上下文推理。尽管仅在12帧384px上训练,它在高达1024帧和1024px时仍保持稳定,而直接视频上下文推理通常会退化。在此扫描中,它将回答时的视觉令牌负载减少高达1500倍,查询TTFT减少6-80倍,同时保持视频忠实输出。我们还发现,为非重叠视频段独立生成的适配器可以在秩空间中组合,这为分块长视频内化提供了一条路径。

英文摘要

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Frames2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Frames2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Frames2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Frames2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.

2606.03504 2026-06-11 cs.CL cs.AI 版本更新

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

BaltiVoice: 巴尔蒂语语音语料库与微调Whisper ASR系统

Muhammad Ali

发表机构 * Independent Researcher(独立研究员) The Islamia University of Bahawalpur(伊斯兰巴哈瓦尔普尔大学)

AI总结 针对无公开ASR资源的巴尔蒂语,构建16.8小时朗读语音语料库并微调Whisper-small模型,在验证集上词错误率从182.18%降至30.07%。

Comments 6 pages, 3 figures, 4 tables. Code and data available at https://github.com/mohdali-dev/BaltiVoice-ASR

详情
AI中文摘要

我们提出了BaltiVoice,一个16.8小时的朗读语音语料库,用于巴尔蒂语(ISO 639-3: bft),这是一种在巴基斯坦吉尔吉特-巴尔蒂斯坦地区使用的藏语语言,此前没有公开可用的ASR资源。该语料库包含10,060条经过验证的本地Nastaliq脚本话语,源自Mozilla Common Voice录音。我们在此语料库上微调了OpenAI Whisper-small,并在包含538条话语的保留验证集上报告了30.07%的词错误率(WER),而Whisper-small在巴尔蒂语上的零样本基线为182.18%。该数据集、微调模型以及实时转录演示均在HuggingFace上公开提供。

英文摘要

We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. Fine-tuning OpenAI Whisper-small yields a Word Error Rate (WER) of 26.74% and a Character Error Rate (CER) of 8.67% on a 538-utterance speaker-disjoint validation set, down from a zero-shot baseline of 159.19% WER and 152.52% CER. A Whisper-base fine-tuned on the same data achieves 44.54% WER and 15.61% CER, confirming that model capacity matters for this low-resource setting. The dataset, fine-tuned model, and a live transcription demo are publicly available on HuggingFace.

2606.03077 2026-06-11 cs.LG cs.AI cs.DC 版本更新

Libra: Efficient Resource Management for Agentic RL Post-Training

Libra:面向智能体强化学习后训练的高效资源管理

Kaiwen Chen, Xin Tan, Jingzong Li, Hong Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) The Hang Seng University of Hong Kong (2018)(香港恒生大学)

AI总结 针对智能体强化学习中长尾、非平稳工作负载带来的资源管理挑战,提出Libra系统,通过周期性全局资源规划器和因果驱动多级反馈队列调度器,实现GPU分配优化和请求调度,最高提升3倍吞吐量和2.5倍收敛速度。

Comments 19 pages, 12 figures

详情
AI中文摘要

强化学习(RL)已成为大型语言模型(LLM)的标准后训练范式,从偏好对齐扩展到复杂推理和多轮智能体行为。在智能体RL中,rollout阶段生成轨迹并调用工具,产生长尾和非平稳的工作负载,挑战了传统的资源管理假设。出现了三个基本挑战。首先,由于长尾分布,一小部分轨迹主导了rollout完成时间。其次,rollout和训练在计算模式、内存需求和对序列长度的敏感性上表现出强烈的不对称性。第三,随着RL策略的演变,轨迹长度分布随时间漂移,使得任何静态资源分配逐渐变得次优。我们提出Libra,引入了两个核心机制。第一个是周期性全局资源规划器,它联合优化rollout和训练集群间的GPU分配。它利用弹性混合池实现阶段间轻量级、非阻塞的工作节点重新分配。第二个是因果驱动的多级反馈队列(C-MLFQ)调度器,它基于从工具返回结果导出的因果信号(而非依赖脆弱的长度的预测)将请求路由到异构的rollout桶。在48个A800 GPU上的评估表明,与基线相比,Libra实现了高达3.0倍的吞吐量提升和高达2.5倍的奖励收敛加速。

英文摘要

Reinforcement learning (RL) has emerged as a standard post-training paradigm for shaping large language models (LLMs) into capable agents. In agentic RL, the rollout stage generates trajectories while invoking tools, producing long-tailed and non-stationary workloads that expose two fundamental challenges in resource management. First, due to the long-tail distribution, a small fraction of trajectories dominates rollout makespan. Second, rollout and training are subject to cross-stage imbalance, as they exhibit strong asymmetry in compute patterns, memory demands, and sensitivity to sequence length. Compounding this asymmetry, the sequence length distribution drifts continuously as the policy evolves, rendering any static resource split progressively suboptimal. We present Libra, a resource management system to address both challenges via two core mechanisms. The first is a global resource planner that jointly optimizes GPU allocation across rollout and training clusters. It leverages an elastic hybrid pool to enable lightweight, non-blocking worker reallocation between stages. The second is a causality-driven multi-level feedback queue (C-MLFQ) scheduler, which routes requests to heterogeneous rollout buckets based on causal signals derived from tool-return outcomes, rather than relying on fragile length predictions. Evaluated on 48 A800 GPUs, Libra achieves up to 3.0x higher throughput and converges up to 2.5x faster in reward compared to the baselines.

2606.00995 2026-06-11 cs.AI 版本更新

Subliminal Learning Is Steering Vector Distillation

潜意识学习是引导向量蒸馏

Camila Blank, Agam Bhatia, Senthooran Rajamanoharan, Arthur Conmy, Neel Nanda

发表机构 * Stanford University(斯坦福大学)

AI总结 本文发现潜意识学习通过单个引导向量实现,并证明这是引导向量蒸馏的特例,解释了非语义数据如何传递语义特征。

详情
AI中文摘要

潜意识学习指的是学生语言模型在教师输出上微调时获得教师的特征(例如,系统提示对猫头鹰的偏好),尽管输出与这些特征在语义上无关。目前尚不清楚没有语义意义的数据如何传递特定的语义特征。在这项工作中,我们表明潜意识学习是由单个引导向量介导的,即添加到模型激活中的向量。在两个开源模型上,我们发现教师的系统提示可以很好地近似为一个引导向量,而学生的行为是通过微调学习对齐向量驱动的。不能被引导向量很好近似的系统提示不会潜意识地学习。这是引导向量蒸馏的一个特例,其中在受引导教师输出上训练的学生学会模仿该引导。我们在一系列语义和随机向量上演示了引导向量蒸馏。向模型激活添加语义向量可以对其行为产生模型无关和模型特定(即非语义)的影响,因此非语义的生成数据可以传递具有语义效果的向量,从而实现潜意识学习。这也解释了为什么潜意识学习不能在模型之间转移。我们发现自适应优化器对于语言模型中的潜意识学习是必要的:引导数据上的激活梯度沿引导方向携带一个小但一致的分量,而非自适应优化器通过允许异常梯度主导来阻碍这一点。

英文摘要

Subliminal learning refers to a student language model acquiring a teacher's traits (e.g. a system-prompted preference for owls) when fine-tuned on the teacher's outputs, despite the outputs being semantically unrelated to those traits. It remains poorly understood how data without semantic meaning can transfer specific semantic traits. In this work, we show that subliminal learning is mediated by a single steering vector, i.e. a vector added to the model's activations. Across two open-source models, we find that the teacher's system prompt is well approximated by a steering vector, and that the student's behavior is driven by learning an aligned vector over fine-tuning. System prompts that are not well approximated by steering vectors are not subliminally learned. This is a special case of steering vector distillation, in which a student trained on the outputs of a steered teacher learns to imitate that steering. We demonstrate steering vector distillation on a range of semantic and random vectors. Adding a semantic vector to a model's activations can have both model-independent and model-specific (i.e. non-semantic) effects on its behavior, so generated data that is non-semantic can transmit a vector with semantic effects, enabling subliminal learning. This also explains why subliminal learning does not transfer between models. We find that adaptive optimizers are necessary for subliminal learning in language models: activation gradients on steered data carry a small but consistent component along the steering direction, and non-adaptive optimizers impede this by allowing outlier gradients to dominate.

2606.00140 2026-06-11 cs.LG cs.AI 版本更新

Geometric Erasure by Contrastive Velocity Matching in Rectified Flows

整流流中对比速度匹配的几何擦除

Jonas Henry Grebe, Tobias Braun, Anna Rohrbach, Marcus Rohrbach

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出GEM框架,通过对比速度匹配实现整流流模型中的概念擦除,结合生成流网络与教师引导的流匹配,有效抑制有害内容生成。

详情
AI中文摘要

尽管多模态生成模型的快速采用提供了巨大潜力,但也增加了有害内容合成、深度伪造和版权侵权的风险。为应对这些挑战,概念擦除作为一种前瞻性防护手段应运而生。然而,随着该领域逐渐从基于U-Net的扩散模型转向整流流变换器,擦除研究难以跟上步伐。在这项工作中,我们引入了GEM,一个简单但高效的整流流模型擦除框架。作为我们贡献的一部分,我们在基于轨迹的遗忘(基于生成流网络)与经典教师引导擦除之间建立了原则性桥梁:我们将基于轨迹的信号转化为教师引导的流匹配设置,统一了两种范式的优势。具体而言,教师提供互补的吸引和排斥信号,我们将其组合成一个单一的几何引导目标,实现对不需要概念的目标抑制,同时保留良性生成。

英文摘要

While the rapid adoption of multimodal generative models offers immense potential, it has also increased the risks of harmful content synthesis, deepfakes, and copyright infringements. To address these challenges, concept erasure has emerged as a prospective safeguard. However, as the field gradually transitions from U-Net-based diffusion models to Rectified Flow Transformers, erasure research has struggled to keep pace. In this work, we introduce GEM, a simple but highly effective erasure framework for Rectified Flow models. As part of our contribution, we establish a principled bridge between trajectory-based unlearning grounded in Generative Flow Networks and classic teacher-guided erasure: we translate trajectory-based signals into a teacher-guided flow-matching setup that unifies the strengths of both paradigms. Concretely, a teacher provides complementary attraction and repulsion signals that we combine into a single geometric guidance objective, yielding targeted suppression of unwanted concepts while preserving benign generation.

2605.30437 2026-06-11 cs.CV 版本更新

Mitigating Content Shift and Hallucination in GenAI Image Editing via Structural Refinement

通过结构细化减轻生成式AI图像编辑中的内容偏移和幻觉

Luxi Zhao, Michael S. Brown

发表机构 * Department of Electrical Engineering & Computer Science(电气工程与计算机科学系)

AI总结 提出一种后处理框架,通过建立粗空间和光度对应关系并融合输入图像与GenAI增强图像,在保留感知增强的同时抑制幻觉内容,从而解决黑盒GenAI图像编辑中的结构保持问题。

详情
AI中文摘要

生成式AI(GenAI)图像编辑器(如Nano Banana)在修图任务中产生视觉上令人满意的结果,使非专家能够仅通过文本提示编辑图像。然而,这些模型的生成性质常常引入空间错位、纹理失真和内容幻觉,这些都对需要像素级保真度的下游工作流程有害。我们为黑盒GenAI图像修图确定了一个称为“结构保持GenAI融合”的问题设置:在保持对原始输入图像的结构忠实性的同时,保留GenAI输出的感知增强。为了解决这个问题,我们提出了一种后处理框架,该框架首先建立粗空间和光度对应关系,然后执行融合阶段,将期望的增强转移同时抑制幻觉内容,从而将输入图像与其GenAI增强版本融合。在此设置中缺乏直接先前工作的情况下,我们针对来自真实感风格迁移和图像融合的代表性方法评估我们的框架。我们的实验表明,我们的方法在保持像素级结构一致性和输入分辨率的同时,更好地保留了美学质量。

英文摘要

Generative AI (GenAI) image editors, such as Nano Banana, produce visually compelling results for retouching tasks, enabling non-experts to edit images through text prompts alone. However, the generative nature of these models often introduces spatial misalignment, texture distortion, and content hallucination, all of which are detrimental to downstream workflows that require pixel-level fidelity. We identify a problem setting we call "structure-preserving GenAI fusion" for black-box GenAI image retouching: retain the perceptual enhancements of a GenAI output while enforcing structural faithfulness to the original input image. To address this problem, we propose a post-processing framework that fuses an input image with its GenAI-enhanced counterpart by first establishing coarse spatial and photometric correspondences, then performing a fusion stage that transfers desired enhancements while suppressing hallucinated content. In the absence of direct prior work in this setting, we evaluate our framework against representative methods from photorealistic style transfer and image fusion. Our experiments demonstrate that our method better preserves aesthetic quality while maintaining pixel-level structural consistency and the input resolution.

2605.29588 2026-06-11 cs.CV cs.AI q-bio.NC 版本更新

Brain-IT-VQA: From Brain Signals to Answers

Brain-IT-VQA: 从脑信号到答案

Roman Beliy, Matias Cosarinsky, Oliver Heinimann, Navve Wasserman, Michal Irani

发表机构 * Weizmann Institute of Science(魏茨曼科学研究所)

AI总结 提出 Brain-IT-VQA 框架,基于 fMRI 脑信号解码语言令牌并结合语言模型进行视觉问答,在 NSD-VQA 新基准上显著优于先前方法,并用于分析脑区对视觉信息的贡献。

详情
AI中文摘要

从观看图像时记录的 fMRI 信号解码视觉内容,特别是回答关于所看图像的问题,是一个长期挑战。尽管近年来在基于 fMRI 的视觉问答(VQA)方面取得了显著进展,但性能仍然有限。此外,尽管最近的模型能够做出越来越准确的预测,但它们很少被用作理解大脑中视觉表征结构的工具。我们提出了 Brain-IT-VQA,一个基于 fMRI 的视觉问答框架。基于脑交互变换器(Brain-IT),我们的方法从脑活动中解码语言令牌,并将其与语言模型集成以回答视觉问题。我们的模型显著优于先前的基于 fMRI 的标题生成和 VQA 方法。我们进一步引入了 NSD-VQA,一个新的基于 fMRI 的视觉问答数据集和基准。与现有的图像-fMRI VQA 数据集通常每张图像只提供少数宽泛且弱控制的问题不同,NSD-VQA 在 20 个受控问题类别中平均每张图像提供 20 个问答对,这些类别解耦了多个层次的视觉理解。这使得在有限的 fMRI 测试数据下能够进行更可靠和可解释的评估。Brain-IT-VQA 和 NSD-VQA 共同提供了一个强大的预测框架和研究脑表征的工具。利用这个基准,我们量化了哪些形式的视觉和语义信息可以从对自然图像的 fMRI 响应中可靠解码。我们进一步分析了不同脑区在不同问题类型上的贡献。

英文摘要

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.

2605.29128 2026-06-11 cs.LG 版本更新

Apertus LLM Family Expansion via Distillation and Quantization

通过蒸馏和量化扩展 Apertus LLM 系列

Andrei Panferov, Davit Melikidze, Martin Jaggi, Dan Alistarh

发表机构 * LLM Family Expansion via Distillation and Quantization(LLM家族通过蒸馏和量化进行扩展)

AI总结 本文通过蒸馏和量化方法,基于 Apertus 8B 模型低成本扩展出参数高达 4B 的模型系列,覆盖多种硬件约束并保持强准确性。

详情
AI中文摘要

LLM 的广泛采用导致它们被用于各种应用和场景,例如聊天助手和数据标注,这要求模型满足特定的预算和硬件约束。这导致了 LLM 以批次发布,包含不同大小的相似模型,以便模型系列尽可能满足广泛的约束。在本文中,我们验证了蒸馏和量化作为将模型系列扩展到新大小和硬件格式的经济有效方法。基于开放配方 Apertus 8B LLM,我们生成了 Apertus-v1.1——一个蒸馏模型系列,参数高达 4B,在 1.7T 许可令牌上训练。我们证明了我们的方法在覆盖大范围的硬件和系统需求方面具有成本效益和强大的准确性性能。

英文摘要

The wide adoption of LLMs has led to their use in great variety of applications and scenarios, such as chatbot assistants and data annotation, creating the need for the models to satisfy certain budget and hardware constraints. This has led to the trend of LLMs being released in batches consisting of similar models of various sizes for the family of models to adhere to as wide of a range of constraints as possible. In this paper, we validate distillation and quantization as a cost-effective way to expand model families to new sizes and hardware formats. Based on the open-recipe Apertus 8B LLM, we produce Apertus-v1.1 - a distilled family of models with up to 4B parameters trained on 1.7T permissive license tokens. We demonstrate cost-efficiency and strong accuracy performance of our approach for covering large ranges of hardware and systems requirements.

2605.28882 2026-06-11 cs.CL cs.AI cs.SD 版本更新

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

GrowLoop: 由人类种子驱动的自进化对话评估

Yihang Lin, Yunze Gao, Zeyang Lin, Dongbo Li, Kun Peng, Yue Liu

发表机构 * Amap, Alibaba Group(阿里集团阿地图) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 针对开放域对话中类人性评估的隐性知识、标准分歧和动态演化三大挑战,提出GrowLoop自进化评估系统,通过最小人工种子标注和启发式学习迭代提取评估标准,并利用标准-案例协同进化机制持续适应模型进步和场景变化。

详情
AI中文摘要

随着大语言模型的快速发展,评估开放域对话中的类人性变得越来越重要。然而,类人性是一种隐性知识,人类可以直观感知,但其背后的标准难以明确表述。人类判断差异很大,在某些情况下高度一致,在其他情况下则存在合理分歧。同时,人类判断背后的标准仍然是隐性的,没有明确的基础来构建案例。此外,什么算作类人并非一成不变,而是随着模型能力和人类期望而演变。尽管在评估方法上取得了进展,如专家编写的基准、奖励模型和自进化基准,但没有一种方法能同时解决这三个挑战。因此,我们提出了GrowLoop,一个自进化的对话评估系统,能够随着模型进步和场景变化而持续适应。以最小的人工种子标注作为初始动力,LLM代理通过启发式学习迭代提取和细化评估标准。在标注者意见一致的地方要求人机一致,而在意见分歧的地方只要求合理性。此外,标准-案例协同进化机制实现了持续进化,当评估目标发生变化时,通过新的种子进行扩展。应用于开放域对话中的类人性评估,生成的标准不仅在与人判断的一致性上显著优于现有方法,而且还发现了标注者忽略的问题。由此产生的基准能够有效区分不同能力层级的模型,并揭示其不足之处,同时能够泛化到新场景并随着模型进步而适应。我们的工作将基准测试范式从手动更新或难度扩展转变为全面、持续的自我进化。

英文摘要

With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet the underlying criteria resist explicit formulation. Human judgments vary widely, with strong agreement on some cases and legitimate disagreement on others. Meanwhile, the criteria behind human judgments remain implicit, leaving no clear basis for constructing cases. Further, what counts as human-likeness is not static, but evolving with model capability and human expectations. Despite progress in evaluation methods such as expert-authored benchmarks, Reward Models, and self-evolving benchmarks, none addresses all three challenges simultaneously. Therefore, we propose GrowLoop, a self-evolving conversation evaluation system that continuously adapts as models advance and scenarios shift. Starting from minimal human seed annotations, LLM agents iteratively extract and refine evaluation rubrics through Heuristic Learning. Human-AI agreement is required where annotators converge, while only plausibility is expected where they diverge. Moreover, the Rubric-Case co-evolution mechanism enables continuous evolution. When the evaluation target shifts, new human seeds expand the system's coverage accordingly. When applied to human-likeness evaluation in open-ended conversation, the AI judge guided by these rubrics not only substantially outperforms existing methods in alignment with human judgments, but also uncovers issues that annotators overlook. The resulting benchmark effectively discriminates models across capability tiers and reveals where they fall short, while generalizing to new scenarios and adapting as models advance. Our work shifts the benchmarking paradigm from manual updates or difficulty scaling to comprehensive, continuous self-evolution.