arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.09287 2026-06-11 cs.LG 版本更新

Trajectory Geometry of Transformer Representations Across Layers

Transformer表示在层间的轨迹几何

Vishal Pandey, Gopal Singh, Yacine Mahdid

发表机构 * MetriQual ； London, UK（英国伦敦）； Athens, GR（希腊雅典）

AI总结通过计算轨迹长度、曲率等几何指标，发现语义相关提示在中间层收敛、推理任务曲率更大、歧义token轨迹分叉，并揭示三层结构。

Comments 18 pages, 9 figures

详情

AI中文摘要

理解Transformer表示如何跨层演化，而不仅仅是它们编码了什么，仍然是机械可解释性中的一个开放问题。我们将Transformer前向传播重新解释为通过高维表示流形的离散群体轨迹，借鉴了计算神经科学的几何工具。我们不是探测预定义的特征，而是使用直接在环境空间中计算的五个指标来表征轨迹几何：轨迹长度、曲率、语义收敛指数、逐层余弦相似度和表示稳定性。在三个模型家族（GPT-2、TinyLlama、Qwen2.5）和五个受控提示家族中，我们报告了四个发现。首先，语义相关的提示在中间到后期层显著收敛（峰值CI 0.41--0.58，p<0.001，Mann-Whitney U），与吸引子动力学一致。其次，推理任务产生的轨迹曲率大于词汇变化（0.71--0.83弧度 vs. 0.27--0.31弧度），表明曲率编码了计算复杂度。第三，歧义token表现出轨迹分叉，在最后一层表示分离高达5.6倍，而在无歧义控制中则没有。第四，逐层余弦相似度揭示了一个普遍的三阶段结构：编码、精化和输出准备，在所有三种架构中一致。所有四个效应在打乱层和随机嵌入控制下消失。我们发布了一个完全开源、模型无关的管道，并认为轨迹几何构成了一个原则性的、无探针的机械可解释性视角。

英文摘要

Understanding how transformer representations evolve across layers, not merely what they encode, remains an open problem in mechanistic interpretability. We recast the transformer forward pass as a discrete population trajectory through a high-dimensional representation manifold, drawing on geometric tools from computational neuroscience. Rather than probing for pre-specified features, we characterize trajectory geometry using five metrics computed directly in the ambient space: trajectory length, curvature, a semantic convergence index, layerwise cosine similarity, and representational stability. Across three model families (GPT-2, TinyLlama, Qwen2.5) and five controlled prompt families, we report four findings. First, semantically related prompts converge significantly in middle-to-late layers (peak CI 0.41--0.58, p<0.001, Mann-Whitney U), consistent with attractor-like dynamics. Second, reasoning tasks produce trajectories of greater curvature than lexical variations (0.71--0.83 rad vs. 0.27--0.31 rad), suggesting curvature encodes computational complexity. Third, ambiguous tokens exhibit trajectory bifurcation with up to 5.6x representational separation by the final layer, absent in unambiguous controls. Fourth, layerwise cosine similarity reveals a universal three-phase structure: encoding, elaboration, and output preparation, consistent across all three architectures. All four effects vanish under shuffled-layer and random-embedding controls. We release a fully open-source, model-agnostic pipeline and argue that trajectory geometry constitutes a principled, probe-free lens for mechanistic interpretability.

URL PDF HTML ☆

赞 0 踩 0

2606.09105 2026-06-11 cs.AI 版本更新

Graph2Idea:Retrieval-Augmented Scientific Idea Generation with Graph-Structured Contexts

Graph2Idea：基于检索增强的图结构上下文科学想法生成

Xu Li, Hanzhe Tu, Xun Han

发表机构 * Southwest Petroleum University（西南石油大学）； Sichuan Police College（四川警察学院）

AI总结提出Graph2Idea框架，利用知识图谱将检索文献转化为结构化三元组，提取图衍生上下文，通过两阶段生成过程提高科学想法的新颖性、质量和可行性。

详情

AI中文摘要

生成新颖、可行且高质量的研究想法是科学发现中重要但具有挑战性的任务。近期基于大语言模型（LLM）的方法通常通过检索文献来支撑想法生成，但检索到的证据通常以平面文本形式提供，如标题、摘要或总结。这种平面上下文可能包含冗余或弱相关信息，同时使得问题、方法、机制和发现之间的跨论文关系难以识别和追踪。为解决这一挑战，我们提出Graph2Idea，一种知识图谱引导的检索增强科学想法生成框架。Graph2Idea首先根据输入主题检索论文，将其转化为结构化知识三元组，并动态构建以目标为中心的知识图谱，使文献关系明确化。然后，它提取紧凑的图衍生上下文，保留与目标相关的关系证据，同时减少噪声文本输入。基于这些上下文，两阶段生成过程首先识别有前景的研究方向，然后引导LLM从图基础证据中综合候选想法。在科学想法生成基准上的实验表明，Graph2Idea在自动评估协议下优于代表性基线。与最强基线分数相比，它将新颖性从0.45提升至0.52，质量从0.24提升至0.29，可行性从0.22提升至0.28。这些结果表明，图结构证据有助于LLM通过更明确、紧凑和可追溯的先前科学知识重组来生成研究想法。

英文摘要

Generating novel, feasible, and high-quality research ideas is an important yet challenging task in scientific discovery. Recent Large Language Model (LLM)-based methods often ground idea generation with retrieved literature, but the retrieved evidence is usually provided as flat text, such as titles, abstracts, or summaries. Such flat contexts may contain redundant or weakly relevant information, while making cross-paper relations among problems, methods, mechanisms, and findings difficult to identify and trace. To address this challenge, we propose Graph2Idea, a knowledge graph-guided framework for retrieval-augmented scientific idea generation.Graph2Idea first retrieves papers according to the input topic, transforms them into structured knowledge triples, and dynamically constructs a target-centered knowledge graph to make literature relations explicit. It then extracts compact graph-derived contexts that retain target-relevant relational evidence while reducing noisy textual input. Based on these contexts, a two-stage generation process first identifies promising research directions and then guides the LLM to synthesize candidate ideas from graph-grounded evidence. Experiments on a scientific idea generation benchmark show that Graph2Idea outperforms representative baselines under the automatic evaluation protocol. Compared with the strongest baseline scores, it improves Novelty from 0.45 to 0.52, Quality from 0.24 to 0.29, and Feasibility from 0.22 to 0.28. These results suggest that graph-structured evidence helps LLMs generate research ideas through more explicit, compact, and traceable recombination of prior scientific knowledge.

URL PDF HTML ☆

赞 0 踩 0

2606.08956 2026-06-11 cs.LG 版本更新

From inverse problems to neural operators: prediction, mechanism, and generalization of data-driven models

从反问题到神经算子：数据驱动模型的预测、机制与泛化

Conor Rowan

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

AI总结本文从哲学视角统一反问题、稀疏辨识、神经常微分方程和神经算子等数据驱动建模策略，指出它们仅在输入-输出关系的模型类假设上不同，并论证只有某些模型能发现机制并实现泛化。

详情

AI中文摘要

科学家历来依赖基于微分方程的数学模型来关联系统输入（力、通量或热源）与输出（位移、速度、浓度和温度）。这些模型依赖深厚的领域知识来确定控制微分方程的形式，然后通过求解反问题用数据校准。近年来，科学机器学习领域引入了多种针对物理系统的替代建模策略。一种称为非线性动力学稀疏辨识的方法，将控制方程学习为用户定义库中项的稀疏线性组合。神经常微分方程通过将状态及其导数输入神经网络来构建控制方程。神经算子则完全摒弃微分方程的建模框架，直接学习系统输入与输出之间的非线性映射。从反问题到神经算子，所有这些建模策略都可以概念化为数据驱动机制，用于预测系统在一系列输入下的响应。因此，自然会思考这些不同策略之间究竟如何关联，以及它们能否被清晰地分类。借鉴科学模型的哲学文献，我们认为许多模型类型具有共同结构，仅在其定义的输入-输出关系的假设模型类上有所不同。联系关于机制的哲学观点，并论证物理系统的数据来自简洁微分方程的解，我们提出只有某些模型能够发现机制，从而实现泛化。我们的分析旨在统一看似不同的建模策略，并为其适当使用场景提供见解。

英文摘要

Scientists have historically relied on mathematical models based on differential equations to relate system inputs -- forces, fluxes, or heat sources -- to outputs, such as displacement, velocity, concentration, and temperature. These models rely on deep domain knowledge to determine the form of the governing differential equation, which is then calibrated with data by solving an inverse problem. In recent years, the field of Scientific Machine Learning has introduced a variety of alternative modeling strategies for physical systems. A method called Sparse Identification of Nonlinear Dynamics learns the governing equation as a sparse linear combination of terms in a user-defined library. Neural Ordinary Differential Equations construct the governing equation by taking in the state and its derivatives at the input layer of a neural network. Entirely foregoing the modeling framework of differential equations, neural operators directly learn a non-linear mapping between the system inputs and outputs. From inverse problems to neural operators, all of these modeling strategies can be conceptualized as data-driven machinery to predict a system's response over a range of inputs. It is then natural to wonder how exactly these various strategies relate to each other, and whether they can be neatly taxonomized. Drawing from the philosophical literature on scientific models, we argue that many model types have a common structure, differing only in the assumed model class of the input-output relation they define. Connecting to philosophical ideas on mechanism, and arguing that data from physical systems arises from solutions to parsimonious differential equations, we propose that only certain models are capable of mechanism discovery, and thus generalization. Our analysis is intended to unite apparently disparate modeling strategies and provide insight into their appropriate use cases.

URL PDF HTML ☆

赞 0 踩 0

2606.08744 2026-06-11 cs.CV 版本更新

MB-Loc: Multi-planar Bird's-eye-view Localization in outdoor LiDAR scenes

MB-Loc：室外LiDAR场景中的多平面鸟瞰图定位

Ayaan Choudhury, Preet Savalia, Anirudh Pydah, Avinash Sharma

发表机构 * Indian Institute of Technology Jodhpur（印度理工学院焦特布尔分校）

AI总结提出MB-Loc框架，通过将LiDAR扫描投影为2.5D多平面鸟瞰图表示，结合KL正则化隐瓶颈和3D空间增强，实现轻量级、视角鲁棒的场景坐标回归定位，在NCLT数据集上达到实时推理并超越现有方法。

详情

AI中文摘要

全局LiDAR定位是自主导航系统的基本任务。最近的方法通过预测密集的3D世界坐标进行场景坐标回归（SCR），相比绝对位姿回归（APR）方法实现了更高的精度。然而，SCR方法引入了两个主要瓶颈：处理原始3D几何结构导致的严重计算低效，以及在不同传感器视角下性能显著下降。为了解决这些限制，我们提出了MB-Loc，一个轻量级且视角鲁棒的SCR框架。我们不依赖沉重的3D卷积，而是将输入的LiDAR扫描投影为2.5D多平面鸟瞰图（BEV）表示。通过沿Z轴切片点云并将有符号深度映射到离散的2D平面，MB-Loc保留了关键的3D几何结构，同时利用了标准2D CNN的计算可处理性。为了处理室外LiDAR固有的稀疏性，我们引入了一个KL正则化的隐瓶颈，该瓶颈在不注入随机噪声的情况下显式建模空间不确定性。最后，为了确保旋转鲁棒性，我们在平面投影之前应用3D空间增强，迫使网络隐式学习视角不变的特征。我们在公开的NCLT数据集上进行了大量实验，证明了我们提出的方法优于当前最先进的方法。以实时推理速度运行，MB-Loc在计算效率上显著优于传统的3D-SCR架构。

英文摘要

Global LiDAR localization is a fundamental task for autonomous navigation systems. Recent methods perform Scene Coordinate Regression (SCR) and achieve superior accuracy over Absolute Pose Regression (APR) solutions by predicting dense 3D world coordinates. However, SCR approaches introduce two major bottlenecks: severe computational inefficiency from processing raw 3D geometries and significant performance degradation under varying sensor viewpoints. To address these limitations, we present MB-Loc, a lightweight and viewpoint-robust SCR framework. Instead of relying on heavy 3D convolutions, we project the input LiDAR scan into a 2.5D Multi-planar Bird's-Eye View (BEV) representation. By slicing the point-cloud along the Z-axis and mapping signed depths into discrete 2D planes, MB-Loc retains essential 3D geometric structures while exploiting the computational tractability of standard 2D CNNs. To handle the inherent sparsity of outdoor LiDAR, we introduce a KL-regularized latent bottleneck that explicitly models spatial uncertainty without injecting stochastic noise. Finally, to ensure rotation robustness, we apply 3D spatial augmentations prior to planar projection, forcing the network to implicitly learn viewpoint-invariant features. We perform extensive experiments on the publicly available NCLT dataset and demonstrate that our proposed method outperforms the current state-of-the-art. Operating at real-time inference speeds, MB-Loc significantly outperforms traditional 3D-SCR architectures in computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.08530 2026-06-11 cs.RO cs.AI 版本更新

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

GEAR-VLA：学习几何感知的动作表示以实现可泛化的机器人操作

Yuan Zhang, Shiqi Zhang, Yedong Shen, Shuai Dong, Jiajun Deng, Xin Zhang, Yuxuan Gao, Jiajia Wu, Xin Nie, Zhiyuan Cheng, Jianmin Ji, Yanyong Zhang, Xingyi Zhang, Jia Pan

发表机构 * Anhui University（安徽大学）； University of Science and Technology of China（中国科学技术大学）； iFLYTEK（科大讯飞）

AI总结提出GEAR-VLA框架，通过粗到细的动作学习、语义对齐的3D集成和具身规范化，学习统一的几何感知动作表示，实现跨物体、背景和机器人的泛化操作。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在基准测试中表现强劲，但在实际部署中仍难以应对未见过的物体、背景变化和不同的机器人本体。我们认为这源于缺乏统一的几何感知操作表示，使得现有VLA容易受到低级轨迹监督、不对齐的3D特征和本体差异的影响。为此，我们提出GEAR-VLA，一个用于学习统一几何感知动作表示以实现可泛化机器人操作的VLA框架。GEAR-VLA采用粗到细的动作学习，其中多源具身预训练赋予VLM具身推理和离散动作理解能力，随后潜在动作标记将动作语义连接到梯度解耦的DiT连续动作专家。它通过将可训练的3D空间骨干与VLA表示对齐，同时冻结原始VLM对齐的视觉通路，进一步执行语义对齐的3D集成。为了跨机器人共享该表示，GEAR-VLA使用具身规范化，其中具身感知状态和具身不变动作将机器人差异限制在低级接口。大量的仿真和真实实验证明了强大的泛化能力：GEAR-VLA在LIBERO、零样本LIBERO-Plus和RoboTwin 2.0上达到了最先进的性能，在AgileX上达到85.9%的成功率，在预训练未见过的LDT-01本体上达到81.0%，并在包含212个未见物体的6,360次试验通用抓取基准上获得90.1%的成功率。代码和模型将在https://github.com/babynabeauty/GEAR-VLA发布。

英文摘要

Vision-Language-Action (VLA) models achieve strong benchmark performance but still struggle in real-world deployment with unseen objects, background shifts, and different robot embodiments. We argue that this stems from the lack of a unified geometry-aware manipulation representation, leaving existing VLAs vulnerable to low-level trajectory supervision, misaligned 3D features, and embodiment differences. To address this, we propose GEAR-VLA, a VLA framework for learning unified geometry-aware action representations for generalizable robotic manipulation. GEAR-VLA adopts coarse-to-fine action learning, where multi-source embodied pretraining equips the VLM with embodied reasoning and discrete action understanding before latent action tokens connect action semantics to a gradient-decoupled DiT continuous action expert. It further performs semantic-aligned 3D integration by aligning a trainable 3D spatial backbone with the VLA representation while freezing the original VLM-aligned visual pathway. To share this representation across robots, GEAR-VLA uses embodiment canonicalization, where embodiment-aware states and embodiment-invariant actions confine robot differences to the low-level interface. Extensive simulation and real-world experiments demonstrate strong generalization: GEAR-VLA achieves state-of-the-art performance on LIBERO, zero-shot LIBERO-Plus, and RoboTwin 2.0, reaches 85.9% success on AgileX and 81.0% on the pretraining-unseen LDT-01 embodiment, and obtains 90.1% success on a 6,360-trial universal grasping benchmark with 212 unseen objects. Code and models will be released at https://github.com/babynabeauty/GEAR-VLA.

URL PDF HTML ☆

赞 0 踩 0

2606.08415 2026-06-11 cs.CV cs.AI 版本更新

CoVEBench: Can Video Editing Models Handle Complex Instructions?

CoVEBench: 视频编辑模型能处理复杂指令吗？

Jiangtao Wu, Jiaming Wang, Yiwen He, Yuanxing Zhang, Shihao Li, Dunyuan Liu, Xuedong Zhao, Jialu Chen, Zekun Moore Wang, Jiaheng Liu

发表机构 * Nanjing University（南京大学）； Kuaishou Technology（快手科技）

AI总结提出CoVEBench基准，包含416个源视频和626条多点编辑指令，通过MLLM评估指令遵循度和保真度，揭示当前模型在组合编辑中常遗漏编辑或破坏保留约束。

Comments 34 pages, 11 figures, 9 tables

详情

AI中文摘要

虽然近期基于文本引导的视频编辑模型在基础任务（如风格迁移、物体插入）上表现出色，但现实用户请求具有高度组合性。单个提示通常要求多个耦合编辑，例如同时修改主体、动作和相机视角，同时严格保留无关的时空内容。现有基准受限于孤立编辑和粗粒度全局指标，无法诊断模型如何处理此类复杂工作流。为弥补这一空白，我们引入CoVEBench，一个组合视频编辑基准，包含416个精心策划的源视频、626条多点编辑指令和9,990个细粒度检查项。CoVEBench覆盖多样化的编辑维度，通过MLLM评判的指令遵循度和视频保真度，以及视频质量的自动指标来评估模型。大量实验表明，组合编辑仍然是一个深层次的挑战：当前模型在处理多个操作同时进行时，经常遗漏编辑、违反保留约束或引入伪影。CoVEBench为推进视频编辑向现实用户工作流发展提供了一个具有挑战性的诊断测试平台。

英文摘要

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.08343 2026-06-11 cs.LG 版本更新

GENERIC-FNO: Embedding Energy Conservation and Entropy Production into Fourier Neural Operators

GENERIC-FNO：将能量守恒和熵产生嵌入傅里叶神经算子

Jason Sulskis, Sathya Ravi

发表机构 * University of Illinois at Chicago（伊利诺伊大学芝加哥分校）； Georgia Tech Research Institute（佐治亚理工学院研究所）

AI总结提出GENERIC-FNO，首个在函数空间直接嵌入非平衡热力学完整GENERIC结构的神经算子，通过秩一投影精确满足退化条件，实现能量守恒与熵产生，在超分辨率下保持结构保证。

Comments Under review at TMLR

详情

AI中文摘要

我们引入了GENERIC-FNO，这是第一个将非平衡热力学的完整GENERIC（度量-辛）结构——可逆、能量守恒动力学和不可逆、熵产生动力学通过退化条件耦合——直接嵌入函数空间的神经算子。现有的保结构神经算子最多强制执行单一守恒律或可逆（哈密顿）结构，而热力学一致的学习仅限于有限维、图或粒子系统。GENERIC-FNO填补了这一空白：它将能量和熵泛函学习为神经算子，并将泊松和摩擦算子参数化为对角傅里叶乘子，夹在秩一投影之间，通过构造精确满足退化条件，无需惩罚项、更新投影或残差。退化恒等式对任何初始化、维度或分辨率都达到机器精度（残差~10^-13），因此连续时间动力学守恒学习的能量并精确产生熵；显式时间步进仅增加小的O(dt^2)漂移（每步残差~10^-6）。我们进一步指出，给定流的(E,S,L,M)分解并不唯一，并引入了一个规范不变的耗散诊断，独立于学习的泛函分离可逆和耗散动力学。在三个算子主干（1D/2D FNO和DeepONet）和四个涵盖可逆、耗散和混合机制的PDE上，GENERIC-FNO在4倍超分辨率范围（64到256）内零样本保持其精确结构保证，恢复物理耗散的真实顺序，并与强无约束和能量惩罚基线竞争，在相当或更少参数的情况下在多个耗散和混合问题上优于它们。

英文摘要

We introduce GENERIC-FNO, the first neural operator to embed the full GENERIC (metriplectic) structure of nonequilibrium thermodynamics -- reversible, energy-conserving dynamics and irreversible, entropy-producing dynamics coupled through the degeneracy conditions -- directly in function space. Existing structure-preserving neural operators enforce at most a single conservation law or reversible (Hamiltonian) structure, while thermodynamically consistent learning has been confined to finite-dimensional, graph, or particle systems. GENERIC-FNO closes this gap: it learns the energy and entropy functionals as neural operators and parameterizes the Poisson and friction operators as diagonal Fourier multipliers sandwiched between rank-one projections that enforce the degeneracy conditions exactly, by construction, with no penalty term, update projection, or residual. The degeneracy identities hold to machine precision (residuals ~10^-13) for any initialization, dimension, or resolution, so the continuous-time dynamics conserve the learned energy and produce entropy exactly; the explicit time stepping adds only a small O(dt^2) drift (per-step residual ~10^-6). We further note that the (E,S,L,M) decomposition of a given flow is not unique, and introduce a gauge-invariant dissipation diagnostic separating reversible from dissipative dynamics independently of the learned functionals. Across three operator backbones (1D/2D FNOs and DeepONet) and four PDEs spanning reversible, dissipative, and mixed regimes, GENERIC-FNO preserves its exact structural guarantees zero-shot across a 4x super-resolution range (64 to 256), recovers the ground-truth ordering of physical dissipation, and is competitive with strong unconstrained and energy-penalized baselines, outperforming them on several dissipative and mixed problems at comparable or fewer parameters.

URL PDF HTML ☆

赞 0 踩 0

2606.08102 2026-06-11 cs.RO cs.AI cs.MA 版本更新

Continual Quadruped Robots Coordination via Semantic Skill Discovery

通过语义技能发现实现持续四足机器人协调

Daoqing Wang, Yuchen Xiao, Weixuan Huang, Zhilong Zhang, Shenghua Wan, Meng Li, Lei Yuan, Yang Yu

发表机构 * National Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China（新型软件技术国家重点实验室，南京大学，南京，中国）； School of Artificial Intelligence, Nanjing University, Nanjing, China（人工智能学院，南京大学，南京，中国）； Polixir Technologies, Nanjing, China（南京极智科技有限公司）

AI总结提出Conquer框架，通过语义技能库实现多四足机器人在持续学习任务中的协调，避免灾难性遗忘，最终平均成功率95.6%。

Comments 22 pages, 8 figures, 11 tables. Project page: https://conquer-project.pages.dev/

详情

AI中文摘要

多四足协调因其增强的负载能力、更广的接触覆盖范围以及对挑战性任务的适应性提升而受到越来越多的关注。现有的多四足操作方法通常专注于预定义或封闭的任务族，往往依赖多智能体强化学习（MARL）来训练特定任务的协调策略。然而，这类方法在开放式持续学习场景中难以应对，其中任务顺序到达，机器人期望在复用先前学到的技能的同时获取新协调技能，且不出现灾难性遗忘。为应对这一挑战，我们提出Conquer，一个语义技能库框架，将持续多四足协调形式化为检索-适应-更新过程。首先，为适应不同任务中的团队规模变化，我们设计了一个团队结构的Self-Allies-Goal（SAG）主干，通过显式建模每个机器人自身状态、队友上下文和任务目标，支持可变基数的机器人团队。对于每个新任务，Conquer从执行前信息构建任务级语义描述符，并从技能库中检索相关技能进行适应。成功执行后，Conquer通过提取轨迹级语义描述符并根据语义距离组织它们来更新技能库，从而实现持续技能积累和跨任务知识迁移。仿真实验表明，Conquer达到了95.6%的最终平均成功率，展示了强大的前向迁移能力和可忽略的灾难性遗忘。在宇树Go2团队上的实际部署进一步验证了Conquer用于实际多四足协调的可行性。仿真和真实机器人演示视频见：https://conquer-project.pages.dev/。

英文摘要

Multi-quadruped coordination has attracted increasing attention due to its enhanced payload capacity, broader contact coverage, and improved adaptability to challenging tasks. Existing methods for multi-quadruped manipulation typically focus on predefined or closed task families, often relying on multi-agent reinforcement learning (MARL) to train task-specific coordination policies. However, such methods struggle in open-ended continual learning settings, where tasks arrive sequentially and robots are expected to acquire new coordination skills while reusing previously learned ones without catastrophic forgetting. To address this challenge, we propose Conquer, a semantic skill-library framework that formulates continual multi-quadruped coordination as a retrieve-adapt-update process. First, to accommodate varying team sizes across tasks, we design a team-structured Self-Allies-Goal (SAG) backbone that supports variable-cardinality robot teams by explicitly modeling each robot's own state, teammate context, and task goal. For each incoming task, Conquer constructs a task-level semantic descriptor from pre-execution information and retrieves a relevant skill from the library for adaptation. After successful execution, Conquer updates the skill library by extracting trajectory-level semantic descriptors and organizing them according to semantic distance, thereby enabling continual skill accumulation and cross-task knowledge transfer. Simulation experiments show that Conquer achieves a final average success rate of 95.6%, demonstrating strong forward transfer and negligible catastrophic forgetting. Real-world rollouts on Unitree Go2 teams further validate the deployment feasibility of Conquer for practical multi-quadruped coordination. Simulation and real-robot demonstration videos are available at: https://conquer-project.pages.dev/.

URL PDF HTML ☆

赞 0 踩 0

2606.08011 2026-06-11 cs.CL cs.AI 版本更新

Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation

改写以翻译，翻译以奖励：机器翻译中源端改写的强化学习

Boxuan Lyu, Haiyue Song, Zhi Qu, Hidetaka Kamigaito, Kotaro Funakoshi, Manabu Okumura

发表机构 * Institute of Science Tokyo（东京科学大学）； Preferred Networks Inc（Preferred Networks 公司）； Nara Institute of Science and Technology（奈良先端科学技术大学院大学）

AI总结提出RLSR框架，通过强化学习训练源端改写模型，以翻译质量提升为奖励，无需为每个MT模型调提示，在6个MT模型和16个语言对上超越无改写和同规模提示基线，与235B LLM提示基线性能相当。

详情

AI中文摘要

尽管直接提示现成的大语言模型（LLM）生成保留意义的源端改写可以有效提升机器翻译（MT）质量，但这样做需要为不同的MT模型手动调整提示。在这项工作中，我们提出了RLSR（用于源端改写的强化学习），一种新颖的基于强化学习的框架，用于训练源端改写模型，而无需为每个MT模型调整提示。RLSR通过直接使用每个改写源端所带来的下游翻译质量的提升作为奖励来优化改写模型。跨六个MT模型和16个语言对的广泛实验表明，我们通过RLSR训练的4B改写模型显著优于无改写基线和现有的同规模基于提示的改写基线，同时与基于235B LLM的提示基线相比取得了具有竞争力的性能。

英文摘要

Rewriting source text with large language models (LLMs) before translation has been shown to improve machine translation (MT) quality. However, we find that prompt-based rewriting can degrade translation quality rather than improve it, particularly when smaller LLMs, such as 4B-parameter models, are used. We argue that this limitation stems from the difficulty of controlling rewriting behavior through natural-language prompts alone: a rewrite is useful only if it improves downstream translation, yet existing prompt-based methods do not explicitly optimize for this signal. To address this issue, we propose RLSR (Reinforcement Learning for Source Rewriting), a reinforcement learning framework that trains the rewriting model with a reward based on the downstream translation-quality improvement produced by each rewrite. Experiments across six MT systems and 16 language pairs show that our 4B RLSR-trained rewriting models significantly outperform both the no-rewriting baseline and prompt-based rewriting baselines at the same model scale, while remaining competitive with baselines that use a 235B LLM.

URL PDF HTML ☆

赞 0 踩 0

2606.07909 2026-06-11 cs.AI cs.CL 版本更新

MemToolAgent: Leveraging Memory for Tool Using Agents Based on Environment and User Feedback

MemToolAgent概述：一个简单的餐厅预订场景，其中代理检索相似记忆，接收关于无效时间格式的反馈，并生成反思以更新其记忆

Suleyman Armagan Er, Danilo Ribeiro, Yogesh Virkar, Surafel Lakew, Adi Kalyanpur, James Gung, Thomas Delteil, Arshit Gupta

发表机构 * AWS AI ； University of Washington（华盛顿大学）

AI总结提出MemToolAgent框架，通过记忆管理提升大语言模型代理的工具使用能力，包含记忆提取和动态检索模块，在三个基准上分别提升29%、80%和17%。

Comments 8 pages, 5 figures

详情

AI中文摘要

现代大语言模型（LLM）代理可以使用外部工具帮助用户解决复杂任务。然而，对于需要从长期历史事件或先前的代理-环境交互中学习的问题，LLM代理需要使用记忆机制来存储和检索经验。尽管对话代理存在复杂的记忆系统，但很少有研究实证检验如何通过过去的用户-代理对话来提升代理的工具使用能力。我们提出MemToolAgent，一个通过记忆管理改善工具使用的框架。我们的方法包含一个记忆提取模块，将过去的经验处理成结构化的记忆条目，以及一个检索模块，动态选择存储记忆条目的子集。这使得无需LLM微调即可实现更个性化和准确的响应，与用户偏好和反馈保持一致。总之，本工作有三个主要贡献：（1）统一的记忆条目格式，无需LLM微调即可改善通用和个性化工具使用；（2）基于反思的记忆提取，利用环境和用户反馈将错误执行提炼为批评并存储；（3）一个检索模块，根据记忆相似度分布选择使用多少过去经验。MemToolAgent在WorkBench、NESTFUL和PEToolBench基准上相比强基线分别实现了29%、80%和17%的相对改进。

英文摘要

Modern large language model (LLM) agents can use external tools to help users solve complex tasks. However, for problems that require learning from long-term historical events or from previous agent-environment interactions, LLM agents are required to use memory mechanisms to store and retrieve experiences. While sophisticated memory systems exist for dialogue agents, few studies have empirically examined how to improve agents' tool-using capabilities through past user-agent conversations. We propose MemToolAgent, a framework that improves tool use through memory management. Our approach contains a memory extraction module that processes past experiences into structured memory entries, and a retrieval module that dynamically selects a subset of the stored memory entries. This enables more personalized and accurate responses aligned with user preferences and feedback without requiring LLM fine-tuning. In summary, this work has three main contributions: (1) a unified memory entry format that improves both general-purpose and personalized tool use without LLM fine-tuning, (2) a reflection-based memory extraction that uses environment and user feedback to distill wrong executions into critiques to store, and (3) a retrieval module that chooses how many past experiences to use based on the memory similarity distribution. MemToolAgent achieves 29%, 80%, and 17% relative improvements compared to strong baselines on the WorkBench, NESTFUL, and PEToolBench benchmarks, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.07362 2026-06-11 cs.LG 版本更新

Breaking the Ice: Analyzing Cold Start Latency in vLLM

打破冰层：分析 vLLM 中的冷启动延迟

Huzaifa Shaaban Kabakibo, Animesh Trivedi, Lin Wang

发表机构 * Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country（匿名机构，匿名城市，匿名地区，匿名国家）

AI总结本文首次系统分析 vLLM 推理引擎的冷启动延迟，将其分解为六个基础步骤，发现主要受 CPU 限制，并建立轻量级分析模型预测延迟，为大规模推理环境资源规划提供指导。

Journal ref Proceedings of the 9th MLSys Conference, Bellevue, WA, USA, 2026

详情

AI中文摘要

随着可扩展推理服务的普及，推理引擎的冷启动延迟变得重要。如今，vLLM 已成为许多推理工作负载的事实标准推理引擎。尽管流行，但由于其复杂性和快速演进，尚未有对其启动延迟的系统研究。随着主要架构创新如 V1 API 和 this http URL 的引入，本文首次对 vLLM 启动延迟进行了详细的性能表征。我们将启动过程分解为六个基础步骤，并证明其主要受 CPU 限制。每个步骤在模型级和系统级参数方面表现出一致且可解释的缩放趋势，从而能够细粒度地归因延迟来源。基于这些见解，我们开发了一个轻量级分析模型，能够准确预测给定硬件配置下的 vLLM 启动延迟，为大规模推理环境中的资源规划提供可操作的指导。所有基准测试数据集、分析工具和预测脚本均在此 https URL 开源。

英文摘要

As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de facto inference engine of choice for many inference workloads. Although popular, due to its complexity and rapid evolution, there has not been a systematic study of its startup latency. With major architectural innovations such as the V1 API and the introduction of torch.compile, this paper presents the first detailed performance characterization of vLLM startup latency. We break down the startup process into six foundational steps and demonstrate that it is predominantly CPU bound. Each step exhibits consistent and interpretable scaling trends with respect to model-level and system-level parameters, enabling fine-grained attribution of latency sources. Building on these insights, we develop a lightweight analytical model that accurately predicts vLLM startup latency for a given hardware configuration, providing actionable guidance for resource planning in large-scale inference environments. All benchmarking datasets, analysis tools, and prediction scripts are open sourced at https://github.com/upb-cn/vllm-startup-profiler.

URL PDF HTML ☆

赞 0 踩 0

2606.06921 2026-06-11 cs.SD 版本更新

Towards Event-Robust Acoustic Scene Classification

面向事件鲁棒的声学场景分类

Yiqiang Cai, Bohan Hu, Yu Yang, Pengwei Lu, Shengchen Li, Xi Shao

发表机构 * Xi'an Jiaotong-Liverpool University（西安交通大学利物浦大学）； Zhongdian Zhiheng Information Technology Service Co., Ltd（中电智恒信息技术服务有限公司）； China Telecom Jiangsu Branch（中国电信江苏分公司）； Nanjing University of Posts and Telecommunications（南京邮电大学）

AI总结针对现有声学场景分类系统在未知声音事件下性能下降的问题，提出事件移位声学场景数据集ESAS，通过大语言模型注入前景事件模拟真实环境，评估并推动事件鲁棒ASC研究。

Comments Accepted to Interspeech 2026. The ESAS dataset is available at: https://doi.org/10.5281/zenodo.20623264

2606.06904 2026-06-11 cs.RO cs.CV 版本更新

ActionMap: Robot Policy Learning via Voxel Action Heatmap

ActionMap: 基于体素动作热图的机器人策略学习

Pei Yang, Hai Ci, Yanzhe Chen, Qi Lv, Han Cai, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore（新加坡国立大学Show实验室）； NVIDIA（英伟达）

AI总结提出ActionMap，一种将动作空间建模为体素热图的动作解码器，替代现有VLA模型中的单点预测器，在LIBERO仿真和真实Franka操作中提升性能和数据效率。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在骨干网络、训练方法和数据规模方面快速发展，但将骨干网络隐藏状态转换为连续控制信号的动作解码器几乎没有变化，在大多数现有VLA中仍然是单点预测器。无论是通过自回归词元箱、L1回归还是流匹配去噪实现，所得解码器都将动作空间视为无结构的，在训练期间未利用相邻动作的几何邻近性。为了改进这一点，我们引入了ActionMap，一种体素热图动作头，可以插入现有VLA中替换其原生动作解码器。对于每个新动作，该头预测动作空间上的体素热图，其中每个体素直接存储对应动作的概率。在LIBERO仿真和真实Franka操作中，我们的热图头在匹配训练步数下超越了两种架构不同的骨干网络（例如，在LIBERO四套件平均上比OpenVLA-OFT的L1回归头高出8.2%），在两种骨干网络上以相当或更快的速度收敛，并且在低训练数据下保持显著更高的数据效率。跨骨干网络的一致性表明，动作表示是VLA性能的一个真正杠杆，与进一步的骨干网络或方法缩放不同。项目页面：此 https URL。

英文摘要

Vision-language-action (VLA) models have advanced rapidly across backbones, training recipes, and data scale, yet the action decoder, which converts the backbone's hidden state into a continuous control signal, has barely changed and remains a single-point predictor across the majority of current VLAs. Whether implemented via autoregressive token bins, L1 regression, or flow-matching denoising, the resulting decoder treats the action space as unstructured, leaving the geometric proximity of neighboring actions unexploited during training. To advance this, we introduce ActionMap, a voxel heatmap action head that drops into an existing VLA in place of its native action decoder. For each new action, the head predicts a voxel heatmap over the action space, where each voxel directly stores the probability of the corresponding action. Across LIBERO simulation and real-world Franka manipulation, our heatmap head surpasses two architecturally distinct backbones at matched training steps (e.g., +8.2% over OpenVLA-OFT's L1 regression head on the LIBERO four-suite average), converges at comparable or faster rates on both backbones, and remains markedly more data-efficient at low training data. The cross-backbone consistency indicates that action representation is a real lever for VLA performance, distinct from further backbone or recipe scaling. Project Page: https://showlab.github.io/ActionMap/.

URL PDF HTML ☆

赞 0 踩 0

2606.06065 2026-06-11 cs.CL cs.SD eess.AS 版本更新

Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition

多任务学习还不够：双输出第二语言语音识别中的表示纠缠

Seung Hwan Cho, Young-Min Kim

发表机构 * KAIST（韩国科学技术院）

AI总结针对双输出第二语言语音识别，研究发现多任务学习导致表面转录性能下降，归因于编码器级别的表示纠缠，尤其在英语中随表面-意义差异增大而加剧。

Comments 5 pages, 2 figures, Accepted to the 43rd International Conference on Machine Learning Workshop on Machine Learning for Audio

详情

AI中文摘要

第二语言（L2）语音识别通常需要发音转录和预期意义的转录。多任务学习（MTL）是一种自然的方法，因为它假设共享表示对两个输出都有益。然而，本文表明这一假设在韩语和英语中并不成立。MTL提高了意义转录但降低了表面转录，尤其是在英语中，性能下降与通过Levenshtein编辑距离测量的表面-意义差异成正比。编码器分析将这些模式与编码器级别的纠缠联系起来，韩语保留了不同的任务表示，而英语产生了几乎相同的表示。跨任务解码器分析表明，意义双输出解码器适应了独特的表示，而表面双输出解码器仍受编码器约束。这些发现促使设计能够减轻编码器级别纠缠的MTL框架，以减少双输出L2自动语音识别中的表面性能下降。

英文摘要

Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs. However, this paper shows that this assumption does not hold across Korean and English. MTL improves meaning but degrades surface transcription, especially in English, where the degradation scales with surface-meaning divergence measured by Levenshtein edit distance. Encoder analysis links these patterns to encoder-level entanglement, with Korean preserving distinct task representations while English produces nearly identical ones. Cross-task decoder analysis shows that the meaning dual-output decoder adapts with a unique representation, while the surface dual-output decoder remains constrained by the encoder. These findings motivate the design of MTL frameworks that mitigate encoder-level entanglement to reduce surface degradation in dual-output L2 automatic speech recognition.

URL PDF HTML ☆

赞 0 踩 0

2606.05922 2026-06-11 cs.AI cs.CL cs.LG 版本更新

Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference

回顾性工具优化：通过轨迹回滚上的自我偏好改进LLM智能体

Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia

发表机构 * City University of Hong Kong（香港城市大学）； Microsoft Research Asia（微软亚洲研究院）

AI总结提出一种自监督方法RHO，利用历史轨迹回滚和自偏好选择优化智能体工具集，无需真实标签，在SWE-Bench Pro上通过单轮优化将通过率从59%提升至78%。

Comments Code: https://github.com/wbopan/retro-harness ; Project website: https://paper-rho.wenbo.io

详情

AI中文摘要

AI智能体依赖于技能、工具和工作流程的整合（称为工具集）来解决复杂问题。持续改进这一工具集对于适应新任务至关重要。然而，现有的优化方法通常需要真实验证集，但在实际部署场景中获取此类标注数据非常困难。为解决这一问题，我们提出回顾性工具优化（RHO），一种仅利用过去轨迹的自监督方法。具体而言，RHO从历史轨迹中选择一个多样化的困难任务核心集，并并行重新求解。智能体通过自我验证和自我一致性分析这些回滚，然后生成候选工具集更新，并通过自身的成对自我偏好选择最有效的更新。我们在三个不同领域（涵盖软件工程、技术工作和知识工作）上评估RHO。值得注意的是，单轮优化无需任何外部评分即可将SWE-Bench Pro上的通过率从59%提升至78%。此外，我们的分析表明RHO有效针对先前的失败模式。因此，优化后的工具集改变了智能体的行为模式，并在长周期会话中保持更高的准确性。

英文摘要

AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.

URL PDF HTML ☆

赞 0 踩 0

2606.05394 2026-06-11 cs.SD eess.AS 版本更新

nnAudio 2: Overcoming Dynamic Compilation Barriers and Transform Inconsistencies

nnAudio 2: 克服动态编译障碍与变换不一致性

Abhinaba Roy, Junyi Liang, Dorien Herremans

发表机构 * Singapore University of Technology and Design（新加坡科技设计大学）

AI总结针对 nnAudio 在 TorchScript 编译、逆变换边缘情况和依赖漂移方面的问题，通过移除动态状态变异、限制逆变换适用范围并更新依赖，实现了与现代 PyTorch 和 SciPy 的兼容，提升了可微音频分析的鲁棒性。

详情

AI中文摘要

nnAudio 是一个用于深度学习的开源音频特征提取工具箱，但在当前环境中，其使用受到 TorchScript 不兼容、逆变换边缘情况和依赖漂移的阻碍。我们针对现代 PyTorch 和科学 Python 进行了有针对性的现代化改造。我们通过从脚本化代码路径中移除动态状态变异和模块构造，并收紧逆相关辅助函数中的参数处理，解决了 STFT 和 iSTFT 中的 TorchScript 编译失败问题。我们通过将可靠逆变换限制为均匀 bin 设置（freq_scale='no'），并对不支持的频率尺度引发显式运行时错误，澄清了逆 STFT 行为，防止了静默退化的重构。我们恢复了与现代 SciPy 的 CFP 兼容性，并确保当 gamma = 0 时 VQT 简化为 CQT。回归测试涵盖了新的 STFT/iSTFT 行为，更新后的代码库在现代 Python 环境中通过了完整的仓库测试套件。这些改进为研究和部署中的可微音频分析提供了更坚实的基础。

英文摘要

nnAudio is an open-source audio feature extraction toolbox for deep learning, but its use in current environments is hindered by TorchScript incompatibilities, inverse-transform edge cases, and dependency drift. We present a targeted modernization for modern PyTorch and scientific Python. We resolve TorchScript compilation failures in STFT and iSTFT by removing dynamic state mutation and module construction from scripted code paths and tightening argument handling in inverse-related helpers. We clarify inverse-STFT behavior by restricting reliable inversion to the uniform-bin setting (freq_scale=`no') and raising explicit runtime errors for unsupported frequency scales, preventing silently degraded reconstructions. We restore CFP compatibility with modern SciPy and ensure VQT reduces to CQT when gamma = 0. Regression tests cover the new STFT/iSTFT behaviors, and the updated codebase passes the full repository test suite in a modern Python environment. These improvements provide a more robust foundation for differentiable audio analysis in research and deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.04694 2026-06-11 cs.CL 版本更新

DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

DuDi: 跨语言动词化的双信号蒸馏

Patomporn Payoungkhamdee, Tinnakit Udsa, Jian Gang Ngui, Sarana Nutanong, Alham Fikri Aji, Peerat Limkonchotiwat

发表机构 * School of Information Science and Technology, VISTEC（信息科学与技术学院，VISTEC）； AI Singapore（AI新加坡）； MBZUAI

AI总结提出DuDi框架，通过结合序列级和词元级信号以及跨语言动词化器，提升小语言模型在多语言（尤其是东南亚语言）上的性能。

2606.04351 2026-06-11 cs.CV cs.CL 版本更新

Frames2LoRA: Parametric Video Internalization for Vision-Language Models

Video2LoRA: 视觉-语言模型的参数化视频内化

Manan Suri, Sarvesh Baskar, Dinesh Manocha

发表机构 * University of Maryland, College Park（马里兰大学学院公园分校）

AI总结提出Video2LoRA方法，通过感知器超网络从视频编码中直接生成LoRA适配器，实现零视觉令牌的视频查询，在保持性能的同时大幅降低计算成本。

Comments https://frames2lora.github.io/

详情

AI中文摘要

在视觉-语言模型中处理视频成本高昂：每帧占用数百个令牌，推理成本随每帧和每次重复查询而增加。我们引入Video2LoRA，一种参数化视频内化方法。感知器超网络逐层读取冻结VLM编码视频时产生的中间表示，并在单次前向传播中生成低秩适配（LoRA）适配器。与需要迭代梯度更新的标准LoRA微调不同，Video2LoRA直接从视频预测这些权重。在SmolVLM2 500M和2.2B上针对视频摘要和描述进行训练后，Video2LoRA使得相同的冻结VLM能够仅通过适配器回答查询，在查询时上下文中零视觉令牌。Video2LoRA在两种模型规模的所有五个描述基准测试中，以及在八个视频问答基准测试-规模配对中的七个上，统计上非劣效且等同于直接视频上下文推理。尽管仅在12帧384px上训练，它在高达1024帧和1024px时仍保持稳定，而直接视频上下文推理通常会退化。在此扫描中，它将回答时的视觉令牌负载减少高达1500倍，查询TTFT减少6-80倍，同时保持视频忠实输出。我们还发现，为非重叠视频段独立生成的适配器可以在秩空间中组合，这为分块长视频内化提供了一条路径。

英文摘要

Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Frames2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Frames2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Frames2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Frames2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.

URL PDF HTML ☆

赞 0 踩 0

2606.03504 2026-06-11 cs.CL cs.AI 版本更新

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

BaltiVoice: 巴尔蒂语语音语料库与微调Whisper ASR系统

Muhammad Ali

发表机构 * Independent Researcher（独立研究员）； The Islamia University of Bahawalpur（伊斯兰巴哈瓦尔普尔大学）

AI总结针对无公开ASR资源的巴尔蒂语，构建16.8小时朗读语音语料库并微调Whisper-small模型，在验证集上词错误率从182.18%降至30.07%。

Comments 6 pages, 3 figures, 4 tables. Code and data available at https://github.com/mohdali-dev/BaltiVoice-ASR

2606.03077 2026-06-11 cs.LG cs.AI cs.DC 版本更新

Libra: Efficient Resource Management for Agentic RL Post-Training

Libra：面向智能体强化学习后训练的高效资源管理

Kaiwen Chen, Xin Tan, Jingzong Li, Hong Xu

发表机构 * The Chinese University of Hong Kong（香港中文大学）； The Hang Seng University of Hong Kong (2018)（香港恒生大学）

AI总结针对智能体强化学习中长尾、非平稳工作负载带来的资源管理挑战，提出Libra系统，通过周期性全局资源规划器和因果驱动多级反馈队列调度器，实现GPU分配优化和请求调度，最高提升3倍吞吐量和2.5倍收敛速度。

Comments 19 pages, 12 figures

详情

AI中文摘要

强化学习（RL）已成为大型语言模型（LLM）的标准后训练范式，从偏好对齐扩展到复杂推理和多轮智能体行为。在智能体RL中，rollout阶段生成轨迹并调用工具，产生长尾和非平稳的工作负载，挑战了传统的资源管理假设。出现了三个基本挑战。首先，由于长尾分布，一小部分轨迹主导了rollout完成时间。其次，rollout和训练在计算模式、内存需求和对序列长度的敏感性上表现出强烈的不对称性。第三，随着RL策略的演变，轨迹长度分布随时间漂移，使得任何静态资源分配逐渐变得次优。我们提出Libra，引入了两个核心机制。第一个是周期性全局资源规划器，它联合优化rollout和训练集群间的GPU分配。它利用弹性混合池实现阶段间轻量级、非阻塞的工作节点重新分配。第二个是因果驱动的多级反馈队列（C-MLFQ）调度器，它基于从工具返回结果导出的因果信号（而非依赖脆弱的长度的预测）将请求路由到异构的rollout桶。在48个A800 GPU上的评估表明，与基线相比，Libra实现了高达3.0倍的吞吐量提升和高达2.5倍的奖励收敛加速。

英文摘要

Reinforcement learning (RL) has emerged as a standard post-training paradigm for shaping large language models (LLMs) into capable agents. In agentic RL, the rollout stage generates trajectories while invoking tools, producing long-tailed and non-stationary workloads that expose two fundamental challenges in resource management. First, due to the long-tail distribution, a small fraction of trajectories dominates rollout makespan. Second, rollout and training are subject to cross-stage imbalance, as they exhibit strong asymmetry in compute patterns, memory demands, and sensitivity to sequence length. Compounding this asymmetry, the sequence length distribution drifts continuously as the policy evolves, rendering any static resource split progressively suboptimal. We present Libra, a resource management system to address both challenges via two core mechanisms. The first is a global resource planner that jointly optimizes GPU allocation across rollout and training clusters. It leverages an elastic hybrid pool to enable lightweight, non-blocking worker reallocation between stages. The second is a causality-driven multi-level feedback queue (C-MLFQ) scheduler, which routes requests to heterogeneous rollout buckets based on causal signals derived from tool-return outcomes, rather than relying on fragile length predictions. Evaluated on 48 A800 GPUs, Libra achieves up to 3.0x higher throughput and converges up to 2.5x faster in reward compared to the baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.00995 2026-06-11 cs.AI 版本更新

Subliminal Learning Is Steering Vector Distillation

潜意识学习是引导向量蒸馏

Camila Blank, Agam Bhatia, Senthooran Rajamanoharan, Arthur Conmy, Neel Nanda

发表机构 * Stanford University（斯坦福大学）

AI总结本文发现潜意识学习通过单个引导向量实现，并证明这是引导向量蒸馏的特例，解释了非语义数据如何传递语义特征。

详情

AI中文摘要

潜意识学习指的是学生语言模型在教师输出上微调时获得教师的特征（例如，系统提示对猫头鹰的偏好），尽管输出与这些特征在语义上无关。目前尚不清楚没有语义意义的数据如何传递特定的语义特征。在这项工作中，我们表明潜意识学习是由单个引导向量介导的，即添加到模型激活中的向量。在两个开源模型上，我们发现教师的系统提示可以很好地近似为一个引导向量，而学生的行为是通过微调学习对齐向量驱动的。不能被引导向量很好近似的系统提示不会潜意识地学习。这是引导向量蒸馏的一个特例，其中在受引导教师输出上训练的学生学会模仿该引导。我们在一系列语义和随机向量上演示了引导向量蒸馏。向模型激活添加语义向量可以对其行为产生模型无关和模型特定（即非语义）的影响，因此非语义的生成数据可以传递具有语义效果的向量，从而实现潜意识学习。这也解释了为什么潜意识学习不能在模型之间转移。我们发现自适应优化器对于语言模型中的潜意识学习是必要的：引导数据上的激活梯度沿引导方向携带一个小但一致的分量，而非自适应优化器通过允许异常梯度主导来阻碍这一点。

英文摘要

Subliminal learning refers to a student language model acquiring a teacher's traits (e.g. a system-prompted preference for owls) when fine-tuned on the teacher's outputs, despite the outputs being semantically unrelated to those traits. It remains poorly understood how data without semantic meaning can transfer specific semantic traits. In this work, we show that subliminal learning is mediated by a single steering vector, i.e. a vector added to the model's activations. Across two open-source models, we find that the teacher's system prompt is well approximated by a steering vector, and that the student's behavior is driven by learning an aligned vector over fine-tuning. System prompts that are not well approximated by steering vectors are not subliminally learned. This is a special case of steering vector distillation, in which a student trained on the outputs of a steered teacher learns to imitate that steering. We demonstrate steering vector distillation on a range of semantic and random vectors. Adding a semantic vector to a model's activations can have both model-independent and model-specific (i.e. non-semantic) effects on its behavior, so generated data that is non-semantic can transmit a vector with semantic effects, enabling subliminal learning. This also explains why subliminal learning does not transfer between models. We find that adaptive optimizers are necessary for subliminal learning in language models: activation gradients on steered data carry a small but consistent component along the steering direction, and non-adaptive optimizers impede this by allowing outlier gradients to dominate.

URL PDF HTML ☆

赞 0 踩 0

2606.00140 2026-06-11 cs.LG cs.AI 版本更新

Geometric Erasure by Contrastive Velocity Matching in Rectified Flows

整流流中对比速度匹配的几何擦除

Jonas Henry Grebe, Tobias Braun, Anna Rohrbach, Marcus Rohrbach

发表机构 * University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出GEM框架，通过对比速度匹配实现整流流模型中的概念擦除，结合生成流网络与教师引导的流匹配，有效抑制有害内容生成。

详情

AI中文摘要

尽管多模态生成模型的快速采用提供了巨大潜力，但也增加了有害内容合成、深度伪造和版权侵权的风险。为应对这些挑战，概念擦除作为一种前瞻性防护手段应运而生。然而，随着该领域逐渐从基于U-Net的扩散模型转向整流流变换器，擦除研究难以跟上步伐。在这项工作中，我们引入了GEM，一个简单但高效的整流流模型擦除框架。作为我们贡献的一部分，我们在基于轨迹的遗忘（基于生成流网络）与经典教师引导擦除之间建立了原则性桥梁：我们将基于轨迹的信号转化为教师引导的流匹配设置，统一了两种范式的优势。具体而言，教师提供互补的吸引和排斥信号，我们将其组合成一个单一的几何引导目标，实现对不需要概念的目标抑制，同时保留良性生成。

英文摘要

While the rapid adoption of multimodal generative models offers immense potential, it has also increased the risks of harmful content synthesis, deepfakes, and copyright infringements. To address these challenges, concept erasure has emerged as a prospective safeguard. However, as the field gradually transitions from U-Net-based diffusion models to Rectified Flow Transformers, erasure research has struggled to keep pace. In this work, we introduce GEM, a simple but highly effective erasure framework for Rectified Flow models. As part of our contribution, we establish a principled bridge between trajectory-based unlearning grounded in Generative Flow Networks and classic teacher-guided erasure: we translate trajectory-based signals into a teacher-guided flow-matching setup that unifies the strengths of both paradigms. Concretely, a teacher provides complementary attraction and repulsion signals that we combine into a single geometric guidance objective, yielding targeted suppression of unwanted concepts while preserving benign generation.

URL PDF HTML ☆

赞 0 踩 0

2605.30437 2026-06-11 cs.CV 版本更新

Mitigating Content Shift and Hallucination in GenAI Image Editing via Structural Refinement

通过结构细化减轻生成式AI图像编辑中的内容偏移和幻觉

Luxi Zhao, Michael S. Brown

发表机构 * Department of Electrical Engineering & Computer Science（电气工程与计算机科学系）

AI总结提出一种后处理框架，通过建立粗空间和光度对应关系并融合输入图像与GenAI增强图像，在保留感知增强的同时抑制幻觉内容，从而解决黑盒GenAI图像编辑中的结构保持问题。

详情

AI中文摘要

生成式AI（GenAI）图像编辑器（如Nano Banana）在修图任务中产生视觉上令人满意的结果，使非专家能够仅通过文本提示编辑图像。然而，这些模型的生成性质常常引入空间错位、纹理失真和内容幻觉，这些都对需要像素级保真度的下游工作流程有害。我们为黑盒GenAI图像修图确定了一个称为“结构保持GenAI融合”的问题设置：在保持对原始输入图像的结构忠实性的同时，保留GenAI输出的感知增强。为了解决这个问题，我们提出了一种后处理框架，该框架首先建立粗空间和光度对应关系，然后执行融合阶段，将期望的增强转移同时抑制幻觉内容，从而将输入图像与其GenAI增强版本融合。在此设置中缺乏直接先前工作的情况下，我们针对来自真实感风格迁移和图像融合的代表性方法评估我们的框架。我们的实验表明，我们的方法在保持像素级结构一致性和输入分辨率的同时，更好地保留了美学质量。

英文摘要

Generative AI (GenAI) image editors, such as Nano Banana, produce visually compelling results for retouching tasks, enabling non-experts to edit images through text prompts alone. However, the generative nature of these models often introduces spatial misalignment, texture distortion, and content hallucination, all of which are detrimental to downstream workflows that require pixel-level fidelity. We identify a problem setting we call "structure-preserving GenAI fusion" for black-box GenAI image retouching: retain the perceptual enhancements of a GenAI output while enforcing structural faithfulness to the original input image. To address this problem, we propose a post-processing framework that fuses an input image with its GenAI-enhanced counterpart by first establishing coarse spatial and photometric correspondences, then performing a fusion stage that transfers desired enhancements while suppressing hallucinated content. In the absence of direct prior work in this setting, we evaluate our framework against representative methods from photorealistic style transfer and image fusion. Our experiments demonstrate that our method better preserves aesthetic quality while maintaining pixel-level structural consistency and the input resolution.

URL PDF HTML ☆

赞 0 踩 0

2605.29588 2026-06-11 cs.CV cs.AI q-bio.NC 版本更新

Brain-IT-VQA: From Brain Signals to Answers

Brain-IT-VQA: 从脑信号到答案

Roman Beliy, Matias Cosarinsky, Oliver Heinimann, Navve Wasserman, Michal Irani

发表机构 * Weizmann Institute of Science（魏茨曼科学研究所）

AI总结提出 Brain-IT-VQA 框架，基于 fMRI 脑信号解码语言令牌并结合语言模型进行视觉问答，在 NSD-VQA 新基准上显著优于先前方法，并用于分析脑区对视觉信息的贡献。

详情

AI中文摘要

从观看图像时记录的 fMRI 信号解码视觉内容，特别是回答关于所看图像的问题，是一个长期挑战。尽管近年来在基于 fMRI 的视觉问答（VQA）方面取得了显著进展，但性能仍然有限。此外，尽管最近的模型能够做出越来越准确的预测，但它们很少被用作理解大脑中视觉表征结构的工具。我们提出了 Brain-IT-VQA，一个基于 fMRI 的视觉问答框架。基于脑交互变换器（Brain-IT），我们的方法从脑活动中解码语言令牌，并将其与语言模型集成以回答视觉问题。我们的模型显著优于先前的基于 fMRI 的标题生成和 VQA 方法。我们进一步引入了 NSD-VQA，一个新的基于 fMRI 的视觉问答数据集和基准。与现有的图像-fMRI VQA 数据集通常每张图像只提供少数宽泛且弱控制的问题不同，NSD-VQA 在 20 个受控问题类别中平均每张图像提供 20 个问答对，这些类别解耦了多个层次的视觉理解。这使得在有限的 fMRI 测试数据下能够进行更可靠和可解释的评估。Brain-IT-VQA 和 NSD-VQA 共同提供了一个强大的预测框架和研究脑表征的工具。利用这个基准，我们量化了哪些形式的视觉和语义信息可以从对自然图像的 fMRI 响应中可靠解码。我们进一步分析了不同脑区在不同问题类型上的贡献。

英文摘要

Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.

URL PDF HTML ☆

赞 0 踩 0

2605.29128 2026-06-11 cs.LG 版本更新

Apertus LLM Family Expansion via Distillation and Quantization

通过蒸馏和量化扩展 Apertus LLM 系列

Andrei Panferov, Davit Melikidze, Martin Jaggi, Dan Alistarh

发表机构 * LLM Family Expansion via Distillation and Quantization（LLM家族通过蒸馏和量化进行扩展）

AI总结本文通过蒸馏和量化方法，基于 Apertus 8B 模型低成本扩展出参数高达 4B 的模型系列，覆盖多种硬件约束并保持强准确性。

2605.28882 2026-06-11 cs.CL cs.AI cs.SD 版本更新

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

GrowLoop: 由人类种子驱动的自进化对话评估

Yihang Lin, Yunze Gao, Zeyang Lin, Dongbo Li, Kun Peng, Yue Liu

发表机构 * Amap, Alibaba Group（阿里集团阿地图）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结针对开放域对话中类人性评估的隐性知识、标准分歧和动态演化三大挑战，提出GrowLoop自进化评估系统，通过最小人工种子标注和启发式学习迭代提取评估标准，并利用标准-案例协同进化机制持续适应模型进步和场景变化。

详情

AI中文摘要

随着大语言模型的快速发展，评估开放域对话中的类人性变得越来越重要。然而，类人性是一种隐性知识，人类可以直观感知，但其背后的标准难以明确表述。人类判断差异很大，在某些情况下高度一致，在其他情况下则存在合理分歧。同时，人类判断背后的标准仍然是隐性的，没有明确的基础来构建案例。此外，什么算作类人并非一成不变，而是随着模型能力和人类期望而演变。尽管在评估方法上取得了进展，如专家编写的基准、奖励模型和自进化基准，但没有一种方法能同时解决这三个挑战。因此，我们提出了GrowLoop，一个自进化的对话评估系统，能够随着模型进步和场景变化而持续适应。以最小的人工种子标注作为初始动力，LLM代理通过启发式学习迭代提取和细化评估标准。在标注者意见一致的地方要求人机一致，而在意见分歧的地方只要求合理性。此外，标准-案例协同进化机制实现了持续进化，当评估目标发生变化时，通过新的种子进行扩展。应用于开放域对话中的类人性评估，生成的标准不仅在与人判断的一致性上显著优于现有方法，而且还发现了标注者忽略的问题。由此产生的基准能够有效区分不同能力层级的模型，并揭示其不足之处，同时能够泛化到新场景并随着模型进步而适应。我们的工作将基准测试范式从手动更新或难度扩展转变为全面、持续的自我进化。

英文摘要

With the rapid advancement of large language models, evaluating human-likeness in open-ended conversation has become increasingly important. However, human-likeness is a form of tacit knowledge that humans perceive intuitively, yet the underlying criteria resist explicit formulation. Human judgments vary widely, with strong agreement on some cases and legitimate disagreement on others. Meanwhile, the criteria behind human judgments remain implicit, leaving no clear basis for constructing cases. Further, what counts as human-likeness is not static, but evolving with model capability and human expectations. Despite progress in evaluation methods such as expert-authored benchmarks, Reward Models, and self-evolving benchmarks, none addresses all three challenges simultaneously. Therefore, we propose GrowLoop, a self-evolving conversation evaluation system that continuously adapts as models advance and scenarios shift. Starting from minimal human seed annotations, LLM agents iteratively extract and refine evaluation rubrics through Heuristic Learning. Human-AI agreement is required where annotators converge, while only plausibility is expected where they diverge. Moreover, the Rubric-Case co-evolution mechanism enables continuous evolution. When the evaluation target shifts, new human seeds expand the system's coverage accordingly. When applied to human-likeness evaluation in open-ended conversation, the AI judge guided by these rubrics not only substantially outperforms existing methods in alignment with human judgments, but also uncovers issues that annotators overlook. The resulting benchmark effectively discriminates models across capability tiers and reveals where they fall short, while generalizing to new scenarios and adapting as models advance. Our work shifts the benchmarking paradigm from manual updates or difficulty scaling to comprehensive, continuous self-evolution.

URL PDF HTML ☆

赞 0 踩 0

2605.25820 2026-06-11 cs.LG 版本更新

Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

基于扩散的多模态大语言模型的视觉冗余控制并行解码

Yulin Yuan, Hongshuo Zhao, Xiangming Meng

发表机构 * Zhejiang University（浙江大学）； ZJUI-UIUC Institute（ZJUI-UIUC研究院）

AI总结针对扩散型多模态大语言模型并行解码中视觉冗余问题，提出视觉冗余指数（VRI）和无需训练的视觉冗余控制解码（VRCD）方法，通过令牌到图像的注意力优先选择视觉互补位置，在多个基准上提升准确率。

Comments 18 pages, 5 figures, preprint. Code is available at https://github.com/infiniteYuanyl/VRCD

详情

AI中文摘要

基于扩散的多模态大语言模型（dMLLMs）通过迭代并行预测多个掩码位置的令牌进行解码。这使每个解码步骤成为一个位置选择问题：模型不仅要选择哪些预测单独可靠，还要选择哪些位置应一起提交作为后续解码步骤的上下文。现有的基于置信度的解码独立地对掩码位置进行排序并提交前K个位置，很大程度上忽略了提交的令牌是否提供互补的视觉基础。我们识别了这种策略在多模态设置中的步骤级局限性：在同一步骤中选择的高置信度令牌可能依赖重叠的视觉基础，导致提交的令牌之间出现视觉冗余，从而为后续解码留下较少的互补视觉基础。为了量化这种效应，我们引入了视觉冗余指数（VRI），该指数衡量并行提交的令牌之间的视觉基础重叠程度。为了在解码过程中控制这种冗余，我们提出了视觉冗余控制解码（VRCD），一种无需训练的推理时解码方法，它利用令牌到图像的注意力优先选择视觉互补的位置。在多种多模态基准测试中，VRCD以适度的运行时开销减少了视觉冗余和剩余位置熵。在更长的解码实验中，与基于置信度的解码相比，它在M^3CoT上实现了高达18.8%的相对准确率提升，在MMBench上实现了6.9%的提升。代码将在https://github.com/infiniteYuanyl/VRCD发布。

英文摘要

Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not only which predictions are reliable in isolation, but also which positions should be committed together as context for later decoding steps. Existing confidence-based decoding ranks masked positions independently and commits the top-K positions, largely ignoring whether the committed tokens provide complementary visual grounding. We identify a step-level limitation of this strategy in multimodal settings: high-confidence tokens selected in the same step can rely on overlapping visual grounding, introducing visual redundancy among the committed tokens and leaving less complementary visual grounding available for later decoding. To quantify this effect, we introduce the Visual Redundancy Index (VRI), which measures visual grounding overlap among tokens committed in parallel. To control this redundancy during decoding, we propose Visual-Redundancy-Controlled Decoding (VRCD), a training-free inference-time decoding method that uses token-to-image attention to prioritize visually complementary positions. Across diverse multimodal benchmarks, VRCD reduces visual redundancy and remaining-position entropy with modest runtime overhead. In longer decoding experiments, it also achieves relative accuracy gains of up to 18.8% on M^3CoT and 6.9% on MMBench over confidence-based decoding. Code is available at https://github.com/infiniteYuanyl/VRCD.

URL PDF HTML ☆

赞 0 踩 0

2412.09023 2026-06-11 cs.CV 版本更新

STEAM: Squeeze and Transform Enhanced Attention Module

STEAM: 挤压与变换增强注意力模块

Rishabh Sabharwal, Ram Samarth B B, Parikshit Singh Rathore, Punit Rathore

发表机构 * Department of Electrical Engineering, IIT Bombay, India（印度比哈尔理工学院电子工程系）

AI总结提出一种基于图多头变换器的常参数注意力模块STEAM，同时建模通道和空间注意力，在几乎不增加计算量（GFLOPs）的情况下提升CNN性能。

详情

AI中文摘要

早期工作引入的通道和空间注意力机制增强了深度卷积神经网络（CNN）的表示能力，但往往导致参数和计算成本的增加。虽然近期方法专注于通道注意力的高效特征上下文建模，我们的目标是以最少的参数和减少的计算量全面建模通道和空间注意力。利用图中关系建模的原理，我们引入了一个常参数模块STEAM：挤压与变换增强注意力模块，该模块整合了通道和空间注意力以增强CNN的表示能力。据我们所知，我们是第一个提出基于图的方法来同时建模通道和空间注意力，利用多头图变换器的概念。此外，我们引入了输出引导池化（OGP），它高效捕获空间上下文以进一步增强空间注意力。我们在标准基准数据集上广泛评估了STEAM在大规模图像分类、目标检测和实例分割上的性能。STEAM在标准ResNet-50模型上实现了2%的准确率提升，而GFLOPs仅略有增加。此外，STEAM在准确率上优于领先模块ECA和GCT，同时实现了GFLOPs的三倍减少。

英文摘要

Channel and spatial attention mechanisms introduced in earlier work enhance the representational capabilities of deep convolutional neural networks (CNNs) but often increase parameter and computational costs. While recent approaches focus solely on efficient feature context modeling for channel attention, we aim to model both channel and spatial attention comprehensively with minimal parameters and reduced computation. Leveraging the principles of relational modeling in graphs, we introduce a constant-parameter module, \textit{STEAM: Squeeze and Transform Enhanced Attention Module}, which integrates channel and spatial attention to enhance the representation power of CNNs. To our knowledge, we are the first to propose a graph-based approach for modeling both channel and spatial attention, utilizing concepts from multi-head graph transformers. Additionally, we introduce \textit{Output Guided Pooling} (OGP), which efficiently captures spatial context to further enhance spatial attention. We extensively evaluate STEAM for large-scale image classification, object detection and instance segmentation on standard benchmark datasets. STEAM achieves a \(2\%\) increase in accuracy over the standard ResNet-50 model with only a meager increase in GFLOPs. Furthermore, STEAM outperforms the leading modules, ECA and GCT, in terms of accuracy while achieving a threefold reduction in GFLOPs. The code will be made available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.23694 2026-06-11 cs.CL 版本更新

ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

ChartFI: 多模态大语言模型图表描述的忠实性与洞察力基准测试

Fen Wang, Zekai Shao, Qiman Kang, Chunran Hu, Zhixuan Zhang, Lexu Xie, Chao Liu, Siming Chen

发表机构 * School of Data Science, Fudan University（复旦大学数据科学学院）； Zhengzhou Zhongke Institute of Integrated Circuit and System Application（郑州中凯集成电路与系统应用研究院）； School of Computer Science, Fudan University（复旦大学计算机科学学院）

AI总结提出ChartFI-Bench基准，包含896个复杂图表-描述对，并设计四个评估指标（忠实性、覆盖率、信息量、敏锐度），系统评估多模态大语言模型生成图表描述的质量。

详情

AI中文摘要

图表描述对于可访问性、跨模态检索以及帮助读者从复杂可视化中提取洞察至关重要。随着多模态大语言模型（MLLMs）越来越多地被用于自动生成图表描述，一个关键问题随之出现：这些模型描述图表的忠实性和洞察力究竟如何？当前的基准测试在两个方面存在不足：现有数据集由简单的、同质化的图表与浅显的、枚举事实的描述组成；而流行的评估指标未能捕捉描述质量的多面性。为弥补这些不足，我们提出了图表忠实性与洞察力基准（ChartFI-Bench）。我们首先总结了高质量图表描述的四个维度：事实准确性、显著特征强调、领域知识引导以及图表-文本互补性。在这些维度的指导下，我们构建了一个包含896个图表-描述对的高质量基准，这些对具有视觉上复杂的图表和语义丰富的描述。此外，我们设计了四个对齐的评估指标——忠实性、覆盖率、信息量和敏锐度——以系统评估描述在这些维度上的质量。在主流的MLLMs上进行的实验证明了所提出框架的有效性，并揭示了现有模型中的常见弱点。

英文摘要

Chart descriptions are essential for accessibility, cross-modal retrieval, and assisting readers in extracting insights from complex visualizations. As multimodal large language models (MLLMs) are increasingly adopted for automated chart description generation, a critical question arises: how faithfully and insightfully do these models actually describe charts? Current benchmarks fall short on two fronts: existing datasets consist of simple, homogeneous charts paired with shallow, fact-enumerating descriptions; and prevailing metrics fail to capture the multi-faceted nature of description quality. To address these gaps, we present the Chart Faithfulness and Insightfulness Benchmark (ChartFI-Bench). We first summarize four dimensions that characterize high-quality chart descriptions: factual accuracy, salient feature emphasis, domain-informed guidance, and chart-text complementarity. Guided by these dimensions, we construct a high-quality benchmark comprising 896 chart-description pairs, which feature visually complex charts and semantically rich descriptions. Furthermore, we design four aligned evaluation metrics -- Faithfulness, Coverage, Informativeness, and Acuity -- to systematically assess the quality of descriptions across these dimensions. Experiments conducted on mainstream MLLMs demonstrate the effectiveness of the proposed framework and reveal common weaknesses among existing models.

URL PDF HTML ☆

赞 0 踩 0

2510.04567 2026-06-11 cs.LG cs.AI 版本更新

GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning

GILT：一种无需LLM、无需微调的图基础模型用于上下文学习

Weishuo Ma, Yanbo Wang, Xiyuan Wang, Lei Zou, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University（北京大学人工智能研究院）； Wangxuan Institute of Computer Technology, Peking University（北京大学王宣计算机技术研究所）

AI总结提出GILT框架，通过基于令牌的上下文学习机制统一处理节点、边和图级别的分类任务，无需大语言模型或微调，实现高效泛化。

Comments Accepted as an oral presentation at the GFM @ ICML 2026 Workshop

详情

AI中文摘要

图神经网络（GNN）是处理关系数据的强大工具，但通常难以泛化到未见过的图，从而催生了图基础模型（GFM）的发展。然而，当前的GFM面临图数据极端异质性的挑战，每个图可能具有独特的特征空间、标签集和拓扑结构。为此，出现了两种主要范式：第一种利用大语言模型（LLM），但本质上依赖于文本，因此难以处理海量图中的数值特征；第二种预训练基于结构的模型，但适应新任务通常需要昂贵的每图微调阶段，造成关键效率瓶颈。在这项工作中，我们超越了这些限制，引入了图上下文学习Transformer（GILT），这是一个基于无需LLM且无需微调架构的框架。GILT引入了一种新颖的基于令牌的框架用于图上的上下文学习（ICL），在统一框架中重新定义了跨节点、边和图级别的分类任务。该机制是处理异质性的关键，因为它设计用于操作通用数值特征。此外，它从上下文中动态理解类别语义的能力实现了无需微调的适应。全面实验表明，与基于LLM或基于微调的基线相比，GILT以显著更少的时间实现了更强的少样本性能，验证了我们方法的有效性。我们的代码可在https://github.com/yiming421/inductnode/获取。

英文摘要

Graph Neural Networks (GNNs) are powerful tools for processing relational data but often struggle to generalize to unseen graphs, giving rise to the development of Graph Foundational Models (GFMs). However, current GFMs are challenged by the extreme heterogeneity of graph data, where each graph can possess a unique feature space, label set, and topology. To address this, two main paradigms have emerged. The first leverages Large Language Models (LLMs), but is fundamentally text-dependent, thus struggles to handle the numerical features in vast graphs. The second pre-trains a structure-based model, but the adaptation to new tasks typically requires a costly, per-graph tuning stage, creating a critical efficiency bottleneck. In this work, we move beyond these limitations and introduce \textbf{G}raph \textbf{I}n-context \textbf{L}earning \textbf{T}ransformer (GILT), a framework built on an LLM-free and tuning-free architecture. GILT introduces a novel token-based framework for in-context learning (ICL) on graphs, reframing classification tasks spanning node, edge and graph levels in a unified framework. This mechanism is the key to handling heterogeneity, as it is designed to operate on generic numerical features. Further, its ability to understand class semantics dynamically from the context enables tuning-free adaptation. Comprehensive experiments show that GILT achieves stronger few-shot performance with significantly less time than LLM-based or tuning-based baselines, validating the effectiveness of our approach. Our code is available at: https://github.com/yiming421/inductnode/.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Trajectory Geometry of Transformer Representations Across Layers

Graph2Idea:Retrieval-Augmented Scientific Idea Generation with Graph-Structured Contexts

From inverse problems to neural operators: prediction, mechanism, and generalization of data-driven models

MB-Loc: Multi-planar Bird's-eye-view Localization in outdoor LiDAR scenes

GEAR-VLA: Learning Geometry-Aware Action Representations for Generalizable Robotic Manipulation

CoVEBench: Can Video Editing Models Handle Complex Instructions?

GENERIC-FNO: Embedding Energy Conservation and Entropy Production into Fourier Neural Operators

Continual Quadruped Robots Coordination via Semantic Skill Discovery

Rewrite to Translate, Translate to Reward: Reinforcement Learning for Source Rewriting in Machine Translation

MemToolAgent: Leveraging Memory for Tool Using Agents Based on Environment and User Feedback

Breaking the Ice: Analyzing Cold Start Latency in vLLM

Towards Event-Robust Acoustic Scene Classification

ActionMap: Robot Policy Learning via Voxel Action Heatmap

Multi-task Learning is Not Enough: Representational Entanglement in Dual-output Second Language Speech Recognition

Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference

nnAudio 2: Overcoming Dynamic Compilation Barriers and Transform Inconsistencies

DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

Frames2LoRA: Parametric Video Internalization for Vision-Language Models

BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

Libra: Efficient Resource Management for Agentic RL Post-Training

Subliminal Learning Is Steering Vector Distillation

Geometric Erasure by Contrastive Velocity Matching in Rectified Flows

Mitigating Content Shift and Hallucination in GenAI Image Editing via Structural Refinement

Brain-IT-VQA: From Brain Signals to Answers

Apertus LLM Family Expansion via Distillation and Quantization

GrowLoop: Self-Evolving Conversation Evaluation Seeded by Human

Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

STEAM: Squeeze and Transform Enhanced Attention Module

ChartFI: Benchmarking Faithfulness and Insightfulness of Chart Descriptions from Multimodal Large Language Models

GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning