arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 8087
2606.03273 2026-06-03 cs.CV cs.AI cs.CL

VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch

VistaHop: 视觉深度搜索的多跳视觉推理基准

Hang He, Chuhuai Yue, Chengqi Dong, Chengcheng Wan, Ting Su, Haiying Sun, Jiajun Chai, Xiaohan Wang, Guojun Yin

发表机构 * East China Normal University(东华大学) Meituan(美团) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出VistaHop基准,通过多跳问答任务评估多模态大推理模型在视觉深度搜索中的迭代图像检查、视觉锚点定位和跨证据链推理能力,实验表明现有模型表现有限。

详情
AI中文摘要

视觉深度搜索要求多模态大推理模型(MLRM)智能体通过反复检查图像区域、将中间推理锚定在视觉证据上,并跨长推理链连接细粒度线索来回答复杂的视觉查询。然而,现有基准主要关注单步视觉理解或静态图像问答,对迭代图像检查、视觉锚点定位和多跳证据整合的评估有限。在这项工作中,我们引入了VistaHop,一个用于评估视觉深度搜索中以视觉为中心的搜索和多跳视觉推理的基准。VistaHop包含300张高分辨率图像、25个视觉搜索场景和350个多跳QA任务,这些任务要求模型跟随从视觉锚点出发的证据链,或融合跨多个基于图像的推理路径的信息。我们进一步开发了VistaArena,一个统一的评估环境,支持带有文本搜索、图像搜索、图像裁剪和基于证据的答案验证的工具增强推理。在七个代表性MLRM上的实验表明,当前模型远未解决VistaHop:最佳模型SenseNova-MARS-32B仅达到24.31%的Pass@1。这些结果揭示了在视觉定位、证据重访、长链推理和多锚点信息融合方面的持续局限性,凸显了对更强基准和训练方法的需求,以推动视觉深度搜索的发展。

英文摘要

Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained clues across long reasoning chains. However, existing benchmarks mainly focus on single-step visual understanding or static image-question answering, offering limited evaluation of iterative image inspection, visual-anchor grounding, and multi-hop evidence integration. In this work, we introduce VistaHop, a benchmark for evaluating vision-centric search and multi-hop visual reasoning in Visual DeepSearch. VistaHop contains 300 high-resolution images, 25 visual search scenarios, and 350 multi-hop QA tasks that require models to follow evidence chains from visual anchors or fuse information across multiple image-grounded reasoning paths. We further develop VistaArena, a unified evaluation environment that supports tool-augmented reasoning with text search, image search, image cropping, and evidence-based answer validation. Experiments on seven representative MLRMs show that current models remain far from solving VistaHop: the best model, SenseNova-MARS-32B, achieves only 24.31% Pass@1. These results reveal persistent limitations in visual grounding, evidence revisiting, long-chain reasoning, and multi-anchor information fusion, highlighting the need for stronger benchmarks and training methods for Visual DeepSearch.

2606.03270 2026-06-03 cs.LG cs.AI

Are Common Substructures Transferable? Riemannian Graph Foundation Model with Neural Vector Bundles

常见子结构可迁移吗?基于神经向量丛的黎曼图基础模型

Li Sun, Zhenhao Huang, Yiding Wang, Qin Chen, Pietro Lio, Philip S. Yu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对图结构迁移性理论缺失的问题,提出基于黎曼几何的神经向量丛框架GAUGE,通过内在几何学习实现可迁移子结构表征,在零样本链接预测和图同构任务中验证了优越性。

Comments Accepted by ICML 2026

详情
AI中文摘要

基础模型通过预训练-适应范式引发了革命,最近的研究将这一成功扩展到图。与其他模态不同,图包含丰富的结构模式,但其结构迁移性仍知之甚少。先前的研究考虑离散领域中的常见子结构,我们被一个基本问题所驱动:常见子结构可迁移吗?其背后的理论很大程度上未被探索。在这项工作中,我们转向通过功能行为的视角学习可迁移结构。理论上,我们将可迁移子结构与表示空间的内在几何联系起来。然而,表征这种内在几何很少被触及。基于黎曼几何,我们开发了一个称为神经向量丛的图内在几何学习框架,该框架能够用局部坐标解析内在几何。在此基础上,我们设计了GAUGE,一个可预训练的神经架构,它构建向量丛,展平几何兼容的局部坐标,以及一个新的狄利克雷损失,该损失也衡量迁移努力。我们通过实验验证了其在具有挑战性的任务(包括零样本链接预测和图同构)中的优越表现力。

英文摘要

Foundation models have sparked a revolution via a pretraining-adaptation paradigm, with recent efforts extending this success to graphs. Unlike other modalities, graphs contain rich structural patterns, yet their structural transferability remains poorly understood. Prior studies consider common substructures in the discrete realm, and we are motivated by a fundamental question: Are common substructures transferable? The underlying theory is largely underexplored. In this work, we shift toward learning transferable structures through the lens of functional behavior. Theoretically, we connect transferable substructures to intrinsic geometry of the representation space. However, characterizing such intrinsic geometry has rarely been touched. Grounded in Riemannian geometry, we develop a graph intrinsic geometry learning framework called Neural Vector Bundle, which enables parsing intrinsic geometry with local coordinates. Building on this, we design GAUGE, a pretrainable neural architecture that constructs the vector bundle, flattening geometrically compatible local coordinates, and a new Dirichlet loss, which also measures the transfer effort. We empirically validate its superior expressiveness in challenging tasks including zero-shot link prediction and graph isomorphism.

2606.03268 2026-06-03 cs.RO

EaDex: A Cross-Embodiment Dexterous Manipulation Framework from Low-Cost Demonstrations

EaDex: 一种基于低成本演示的跨形态灵巧操作框架

Qian Zhao, Xin Tong, Chengdong Wu, Yang Yang, Yingtian Li

发表机构 * Faculty of Robot Science and Engineering, Northeastern University(机器人科学与工程学院,东北大学) Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究所,中国科学院) School of Automation, Nanjing University of Information Science and Technology(自动化学院,南京信息科学技术大学)

AI总结 提出EaDex框架,通过RGB-D相机捕捉人手运动并构建结构化演示数据,结合基于接触奖励的动态演示退火机制,在低成本演示条件下实现多形态灵巧操作的快速学习和训练。

Comments 11 pages, 5 figures, Conference: CoRL 2026, Submitted as Preprint

详情
AI中文摘要

灵巧操作学习长期以来受到数据和训练高成本的阻碍,因为纯强化学习通常需要大规模交互探索,而模仿学习依赖于昂贵的高质量演示。为了解决这个问题,我们提出了EaDex,一种在低成本演示条件下的多形态灵巧操作学习框架,它能够快速生成演示数据,从而减少训练时间以实现高效的灵巧操作。在数据层面,EaDex仅使用单个RGB-D相机捕捉人手运动,并通过基于MANO的手部建模、数据归一化和运动重定向构建结构化演示数据。在学习层面,我们引入了一种基于接触奖励的动态演示退火机制,该机制在演示引导下进行早期探索,并随着接触奖励的积累逐渐过渡到自主优化。使用我们自定义的数据集,我们在三种灵巧手和三种铰接物体打开任务上评估了EaDex,涵盖了九种跨形态操作设置,相比没有演示退火的基线实现了55.3%的相对改进。这些结果验证了所提出的低成本演示流程和动态演示退火策略在灵巧操作学习中的有效性。

英文摘要

Dexterous manipulation learning has long been hindered by the high costs of data and training, as pure reinforcement learning typically requires large-scale interactive exploration and imitation learning depends on high-quality demonstrations that are expensive to collect. To address this problem, we propose EaDex, a multi-embodiment dexterous manipulation learning framework under low-cost demonstration conditions, which enables rapid generation of demonstration data and consequently reduces training time for efficient dexterous manipulation. At the data level, EaDex captures human hand motions using only a single RGB-D camera and constructs structured demonstration data through MANO-based hand modeling, data normalization, and motion retargeting. At the learning level, we introduce a contact-reward-based dynamic demonstration annealing mechanism, which guides early-stage exploration under demonstration and gradually transitions to autonomous optimization with accumulating contact rewards. Using our custom dataset, we evaluate EaDex on three dexterous hands and three articulated object-opening tasks, covering nine cross-embodiment manipulation settings, achieving a 55.3% relative improvement over the baseline without demonstration annealing. These results validate the effectiveness of the proposed low-cost demonstration pipeline and the dynamic demonstration annealing strategy for dexterous manipulation learning.

2606.03265 2026-06-03 cs.RO

Wheel-Mounted/GNSS Fusion with AI-Aided Position Updates

基于人工智能辅助位置更新的轮式/GNSS融合定位

Gal Versano, Itzik Klein

发表机构 * Autonomous Navigation and Sensor Fusion Lab(自主导航与传感器融合实验室) Hatter Department of Marine Technologies(海洋技术系) Charney School of Marine Sciences(海洋科学学院) University of Haifa(海法大学)

AI总结 提出一种混合神经惯性导航框架,结合轮式惯性传感器、强制周期轨迹和神经网络,通过误差状态扩展卡尔曼滤波融合GNSS位置更新,实现定位精度提升约46%。

详情
AI中文摘要

精确且鲁棒的定位仍然是自主地面车辆面临的基本挑战。在这项工作中,我们提出了一种混合神经惯性导航框架,该框架集成了轮式惯性传感器、强制周期轨迹以及一个简单高效的神经网络,能够在误差状态扩展卡尔曼滤波中通过GNSS位置更新回归车辆位移。周期轨迹提高了惯性信噪比,使得网络仅利用惯性读数即可估计位移。通过使用多个轮式惯性传感器的真实世界实验验证了该方法。实验结果表明,与标准轮式惯性传感器融合GNSS更新相比,所提方法在定位精度上实现了显著提升,位置均方根误差降低了约46%。

英文摘要

Accurate and robust localization remains a fundamental challenge for autonomous ground vehicles. In this work, we propose a hybrid neural inertial navigation framework that integrates a wheel-mounted inertial sensors, enforced periodic trajectories, and a simple, efficient neural network capable of regressing vehicle displacement with GNSS position updates in an error-state extended Kalman filter. The periodic trajectories increase the inertial signal-to-noise ratio, allowing the network to use only inertial readings to estimate displacement. The approach is validated through real-world experiments using multiple wheel-mounted inertial sensors. Experimental results demonstrate that the proposed method achieves a significant improvement in positioning accuracy, reducing the position root mean squared error by approximately 46 % compared to standard wheel-mounted inertial sensor fusion with GNSS updates.

2606.03264 2026-06-03 cs.CV

PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training

PaddleOCR-VL-1.6:通过欠优化区域精炼和渐进式后训练扩展文档解析前沿

Zelun Zhang, Hongen Liu, Suyin Liang, Yubo Zhang, Yiqing Xiang, Jiaxuan Liu, Ting Sun, Manhui Lin, Yue Zhang, Changda Zhou, Tingquan Gao, Cheng Cui, Yi Liu, Dianhai Yu, Yanjun Ma

发表机构 * PaddlePaddle Team, Baidu Inc.(百度公司PaddlePaddle团队)

AI总结 提出PaddleOCR-VL-1.6,通过区域感知数据优化框架识别并增强前代模型的薄弱区域,结合渐进式后训练策略,在OmniDocBench v1.6上达到96.33%的新SOTA。

详情
AI中文摘要

我们介绍了PaddleOCR-VL-1.6,这是一个基于PaddleOCR-VL-1.5升级的紧凑型文档解析模型。尽管PaddleOCR-VL-1.5建立了强大的0.9B基线,但其剩余错误集中在欠优化区域,这些区域模型行为不稳定、数据覆盖稀疏或监督不可靠。PaddleOCR-VL-1.6没有不加区分地扩大训练语料,而是引入了一个区域感知数据优化框架,从先前模型中识别薄弱区域,对这些区域进行针对性增强,并提高监督信号的可靠性。它进一步采用基于精选数据选择和强化学习的渐进式后训练方案,通过分阶段优化将模型性能提升到更高水平。PaddleOCR-VL-1.6在OmniDocBench v1.6上达到了96.33%的新SOTA分数,展现出与顶级VLM的强劲竞争力,并为PaddleOCR-VL系列提供了实用的后训练方案。

英文摘要

We introduce PaddleOCR-VL-1.6, an upgraded compact document parsing model built upon PaddleOCR-VL-1.5. Although PaddleOCR-VL-1.5 establishes a strong 0.9B baseline, its remaining errors concentrate in under-optimized regions where model behavior is unstable, data coverage is sparse, or supervision is unreliable. Rather than expanding the training corpus indiscriminately, PaddleOCR-VL-1.6 introduces a region-aware data optimization framework that identifies weak regions from the previous model, applies targeted enhancement to these regions, and improves the reliability of supervision signals. It further adopts a progressive post-training recipe based on curated data selection and reinforcement learning, pushing model performance to a higher level through staged optimization. PaddleOCR-VL-1.6 achieves a new state-of-the-art score of 96.33% on OmniDocBench v1.6, demonstrates strong competitiveness against top-tier VLMs, and provides a practical post-training recipe for the PaddleOCR-VL series.

2606.03262 2026-06-03 cs.LG cs.NA math.NA

Let There Be Light: Reflection, Refraction and Scattering for Neural Operators

Let There Be Light: 面向神经算子的反射、折射与散射

Keke Wu, Yixuan Zhang, Jingrun Chen

发表机构 * Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou(苏州先进研究院,中国科学技术大学,苏州) School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei(人工智能与数据科学学院,中国科学技术大学,合肥) Suzhou Big Data & AI Research and Engineering Center, Suzhou(苏州大数据与人工智能研究与工程中心,苏州) School of Mathematical Science, Peking University, Beijing(数学科学学院,北京大学,北京) School of Mathematical Sciences, University of Science and Technology of China, Hefei(数学科学学院,中国科学技术大学,合肥)

AI总结 提出一种受光传输启发的神经算子LiNO,通过反射、折射和散射三种机制分解潜在演化,实现局部特征调制与全局空间通信的结构化分离,并开发高效散射变体将空间复杂度从二次降至线性。

详情
AI中文摘要

神经算子学习无限维函数空间之间的映射,为参数化偏微分方程(PDE)提供数据驱动的代理建模范式。现有架构通常通过在指定变换域中参数化积分核,或对离散空间点应用类似注意力的交互来获得表达能力。尽管这些方法取得了显著进展,但它们常常面临物理可解释性、非局部空间通信、网格可扩展性和计算成本之间的持续权衡。我们提出了一种光启发的神经算子(LiNO),其潜在演化被分解为由基本光传输启发的三种机制:反射、折射和散射。反射和折射在潜在特征空间中充当自适应逐点变换,实现局部特征重定向和各向异性调制,而散射则在物理域上执行输入依赖的非局部传播。我们首先将散射公式化为具有相对位置偏置的归一化成对核,然后开发了一种高效的散射变体,用正特征全局传播和局部扩散分支替代显式的成对交互,将主导空间复杂度从二次降至线性。这产生了一个结构化的神经算子,将局部特征调制与全局空间通信分离,同时保留了模块化和可解释的潜在演化。

英文摘要

Neural operators learn mappings between infinite-dimensional function spaces and provide a data-driven surrogate modeling paradigm for parametric partial differential equations (PDEs). Existing architectures typically obtain expressivity by parameterizing integral kernels in prescribed transform domains or by applying attention-like interactions over discretized spatial points. While these approaches have achieved substantial progress, they often face a persistent trade-off among physical interpretability, nonlocal spatial communication, mesh scalability, and computational cost. We propose a Light-inspired neural operator(LiNO), an operator-learning architecture whose latent evolution is decomposed into three mechanisms motivated by elementary light transport: reflection, refraction, and scattering. Reflection and refraction act as adaptive pointwise transformations in latent feature space, enabling local feature reorientation and anisotropic modulation, whereas scattering performs input-dependent nonlocal propagation over the physical domain. We first formulate scattering as a normalized pairwise kernel with relative positional bias, and then develop an efficient scattering variant that replaces explicit pairwise interactions with positive-feature global propagation and a local diffusion branch, reducing the dominant spatial complexity from quadratic to linear. This yields a structured neural operator that separates local feature modulation from global spatial communication while retaining a modular and interpretable latent evolution.

2606.03260 2026-06-03 cs.LG cs.AI

EqGINO: Equivariant Geometry-Informed Fourier Neural Operators for 3D PDEs

EqGINO: 面向3D PDE的等变几何信息傅里叶神经算子

Sungwon Kim, Juho Song, Seungmin Shin, Guimok Cho, Sangkook Kim, Chanyoung Park

发表机构 * University of Texas at Austin(得克萨斯大学奥斯汀分校)

AI总结 提出EqGINO框架,通过在谱域强制执行各向同性,实现离散对称性的精确等变,并泛化到任意连续旋转,有效建模3D PDE的坐标不变物理规律。

Comments ICML 2026

详情
AI中文摘要

用于3D偏微分方程(PDE)的深度学习代理通常难以在几何变换下泛化,因为它们严重依赖于特定的坐标系。虽然等变网络提供了一种解决方案,但它们通常依赖于空间域中的局部操作,使得对PDE动力学至关重要的全局感受野计算成本高昂。相反,傅里叶神经算子(FNO)高效地捕获全局交互,但由于谱群卷积的过高成本,在其中建立3D等变性仍然不切实际。为弥合这一差距,我们引入了EqGINO,一个在谱域中强制执行各向同性的几何鲁棒框架。通过设计,EqGINO保证对离散化计算域固有的离散对称性具有精确等变性。除了这种离散保证外,我们的结构先验使得即使在有限数量的SE(3)变换训练样本下,也能有效泛化到任意连续方向。因此,我们的方法在复杂的非规则3D几何上鲁棒地建模坐标不变的物理定律。我们的代码可在此https URL获取。

英文摘要

Deep learning surrogates for 3D Partial Differential Equations (PDEs) often fail to generalize across geometric transformations because they depend heavily on specific coordinate systems. While equivariant networks offer a solution, they typically rely on local operations in the spatial domain, making the global receptive field, which is essential for PDE dynamics, computationally expensive. Conversely, Fourier Neural Operators (FNOs) efficiently capture global interactions, yet establishing 3D equivariance within them remains impractical due to the prohibitive cost of spectral group convolutions. To bridge this gap, we introduce EqGINO, a geometrically robust framework that enforces isotropy in the spectral domain. By design, EqGINO guarantees exact equivariance to the discrete symmetries inherent to the discretized computational domain. Beyond this discrete guarantee, our structural prior enables effective generalization to arbitrary continuous orientations even with a limited number of SE(3)-transformed training samples. Consequently, our method robustly models coordinate-invariant physical laws on complex irregular 3D geometries. Our code is available at https://github.com/sung-won-kim/EqGINO

2606.03259 2026-06-03 cs.CL

Beyond "To whom it may concern": Tailoring Machine Translation to Audience and Intent

超越“敬启者”:面向受众和意图的机器翻译定制化

Raphael Merx, Ekaterina Vylomova, Trevor Cohn

发表机构 * The University of Melbourne(墨尔本大学) Google(谷歌)

AI总结 本文通过系统评估50种语言、5种模型规模和8个文本领域,研究了大型语言模型在机器翻译中利用显式指令实现目的驱动翻译的能力,发现指令能显著提升翻译适应性,但传统指标无法评估适应质量。

详情
AI中文摘要

翻译质量取决于目的:同一源文本根据受众、语气和交际意图需要不同的翻译。然而,机器翻译模型和指标将翻译视为从源语言到目标语言的固定映射。大型语言模型使用户能够明确指定目的以及源文本,但这一能力尚未得到大规模评估。我们引入了一种跨50种语言、5种模型规模和8个文本领域的目的驱动翻译的系统评估。我们发现:(1) 显式指令显著提高了翻译的适应性,在非正式领域(对话、社交媒体)、较大模型规模和高资源语言中提升更大;(2) 指令优于语义匹配的少样本示例和段落级上下文;(3) 传统机器翻译指标无法捕捉适应质量,通常惩罚适应性翻译;(4) 当没有精心设计的指令时,模型可以从周围文档上下文中自我生成指令,缩小高达80%的与精心设计指令的适应性差距。我们的结果表明,目的适应型机器翻译是大型语言模型的一种可行且可衡量的能力,同时强调了需要目的感知的指标。

英文摘要

Translation quality depends on purpose: the same source text demands different translations depending on audience, tone, and communicative intent. Yet MT models and metrics treat translation as a fixed mapping from source to target. LLMs enable users to explicitly specify purpose alongside source text, yet this capability has not been evaluated at scale. We introduce a systematic evaluation of purpose-driven MT across 50 languages, 5 model sizes and 8 text domains. We find that (1) explicit instructions substantially improve translation adaptedness, with larger gains on informal domains (conversation, social media), for larger model sizes and for higher-resource languages; (2) instructions outperform semantically-matched few-shot examples and paragraph-level context; (3) traditional MT metrics fail to capture adaptation quality, often penalizing adapted translations; (4) when curated instructions are unavailable, models can self-generate them from surrounding document context, closing up to 80% of the adaptedness gap to curated instructions. Our results establish that purpose-adapted MT is a viable and measurable capability of LLMs, while highlighting the need for purpose-aware metrics.

2606.03252 2026-06-03 cs.RO cs.AI

AirDreamer: Generalist Drone Navigation with World Models

AirDreamer: 基于世界模型的通用无人机导航

Zian Liu, Andong Yang, Chunkai Yang, Ruidong An, Chao Gao, Guyue Zhou

发表机构 * Institute for AI Industry Research, Tsinghua University, Beijing, China(人工智能产业研究院,清华大学,北京,中国) Department of Electronic Engineering, Tsinghua University, Beijing, China(电子工程系,清华大学,北京,中国) School of Remote Sensing and Information Engineering, Wuhan University, Wuhan, China(遥感与信息工程学院,武汉大学,武汉,中国)

AI总结 提出一种结合强化学习策略和世界模型理解的无人机导航框架,通过稀疏奖励函数避免局部最优,在复杂未知环境中实现优于基线5.3%的成功率,并支持零调参的仿真到现实迁移。

Comments 8 pages, 8 figures

详情
AI中文摘要

在未知且杂乱的环境中导航无人机需要可靠地泛化到未见过的场景布局,并理解与机器人能力相关的环境结构。先前的方法假设相同的环境配置,通常严重依赖人工设计的感知管道和预定义规则来引导机器人到达目标。这个过程依赖于环境,且跨环境泛化能力差。受动物导航行为启发,我们设计了一个导航框架,该框架在基于世界模型的环境理解之上使用基于强化学习的策略进行导航,以克服这些问题。此外,我们设计了一个无需手工塑造项的稀疏奖励函数,以避免局部极小值陷阱并鼓励偏航控制行为。在仿真和真实无人机上,我们的方法展现出在复杂未知环境中导航和逃离其他方法失败的局部最优的新兴能力。在具有挑战性的地图上,它比最佳基线实现了5.3%更高的导航成功率。此外,所提出的框架在部署期间无需任何调整即可实现有效的仿真到现实迁移。代码将公开。

英文摘要

Navigating a drone in unseen and cluttered environments requires reliable generalization to unseen scene layouts and understanding of environmental structure relative to the robot's capabilities. Previous methods, which assume the same environment configuration, often rely heavily on human-designed perception pipelines and predefined rules to guide the robot toward the target. This process is environment-dependent and generalizes poorly across environments. Inspired by animal navigation behavior, we design a navigation framework that navigates with a reinforcement-learning-based policy on top of a world-model-based environment understanding to overcome these issues. In addition, a sparse reward function without hand-crafted shaping terms is designed to avoid local minima traps and encourage yaw control behaviors. In simulation and on real drones, our method exhibits emergent capabilities for navigating complex, unseen environments and escaping local optima where other methods fail. In challenging maps, it achieves a 5.3% higher navigation success rate than best baseline. Furthermore, the proposed framework achieves effective sim-to-real transfer without any tuning during deployment. The code will be publicly available.

2606.03250 2026-06-03 cs.CL

The Word and the Way: Strategies for Domain-Specific BERT Pre-Training in German Medical NLP

词与道:德语医学NLP中领域特定BERT预训练的策略

Henry He, Johann Frei, Raphael Schmitt

发表机构 * School of Computation, Information and Technology(计算信息科技学院) Technical University of Munich(慕尼黑技术大学) Chair of IT Infrastructure for Translational Medical Research(转化医学研究信息基础设施主任) Faculty of Applied Computer Science(应用计算机科学学院) University of Augsburg(艾希施泰特大学) Institute of General Practice(基础医学研究所) Faculty of Medicine and Medical Center(医学学院和医学中心)

AI总结 本文提出ChristBERT系列模型,通过对比持续预训练、从头训练和领域词汇适应三种策略,在德语医学NLP任务中实现最优性能,并建立新的基准。

Comments Under revision at BMC Medical Informatics and Decision Making

详情
AI中文摘要

数字医疗产生大量临床文本,可支持AI辅助应用,但德语生物医学语言模型仍受限于较旧的架构或受限的训练数据。我们提出了ChristBERT(临床与健康相关议题及主题调优BERT),这是一个基于德语RoBERTa的领域特定语言模型家族,在包含科学出版物、临床文本、健康相关网络内容和翻译临床资源的13.5GB语料库上训练。为了探究领域适应策略在德语临床NLP中的影响,我们比较了持续预训练、从头训练和领域词汇适应。所得模型在三个医学命名实体识别任务和两个文本分类任务上进行了评估。ChristBERT在五个基准中的四个上持续优于现有的通用和医学德语语言模型,并为德语临床语言建模建立了新的最先进水平。我们的结果表明,最优适应策略取决于任务:在我们的评估中,从头训练对高度专业化的临床文本特别有效,而持续预训练在更常见的医学文本上表现良好。所有模型均已公开发布,以支持德语医学NLP的未来研究和应用。

英文摘要

Digital healthcare generates vast amounts of clinical text that can support AI-assisted applications, yet German biomedical language models remain limited by older architectures or restricted training data. We present ChristBERT (Clinical- and Healthcare-Related Issues and Subjects Tuned BERT), a family of domain-specific German RoBERTa-based language models trained on a 13.5GB corpus of scientific publications, clinical texts, health-related web content, and translated clinical resources. To investigate the impact of domain adaptation strategies in German clinical NLP, we compare continued pre-training, training from scratch, and domain-specific vocabulary adaptation. The resulting models are evaluated on three medical named entity recognition tasks and two text classification tasks. ChristBERT consistently outperforms existing general-purpose and medical German language models on four of five benchmarks and establishes a new state of the art for German clinical language modeling. Our results show that the optimal adaptation strategy is task-dependent: in our evaluation, training from scratch is particularly effective for highly specialized clinical texts, whereas continued pre-training performs well on more commonly written medical texts. All models are publicly released to support future research and applications in German medical NLP.

2606.03246 2026-06-03 cs.CV

MariData: One-Step Unpaired Image Translation for Maritime Environments

MariData: 海洋环境下的单步非配对图像翻译

Santeri Henriksson, Mehdi Asadi, Amin Majd, Juha Kalliovaara

发表机构 * AIS Lab, Turku University of Applied Sciences(涡阳应用科学大学AIS实验室)

AI总结 针对海洋自主水面船舶训练数据稀缺问题,提出基于CycleGAN-turbo的单步非配对图像翻译框架,通过零卷积跳跃连接保留小目标细节,生成逼真的天气与光照条件合成数据。

详情
AI中文摘要

海洋自主水面船舶(MASS)鲁棒感知系统的发展受到多样化训练数据稀缺的严重制约,尤其是恶劣天气和低光照条件。由于在动态海洋环境中收集配对图像在物理上不可行,通过非配对图像到图像翻译生成合成数据提供了一种关键解决方案。然而,现有生成模型因潜在压缩瓶颈而无法保留小型导航目标的精细结构细节。在本文中,我们介绍了一个使用CycleGAN-turbo(一种单步非配对翻译架构)生成合成海洋数据的框架。通过引入零卷积跳跃连接以绕过变分自编码器(VAE)瓶颈,我们的方法在翻译过程中明确保留了小目标细节(例如远处的船只和海上标志)。我们收集了一个包含7000张海洋图像的数据集,用于训练和评估白天到雾天、白天到日落以及白天到夜晚的域翻译模型。定性评估和变强度推理研究表明,我们的方法有效地合成了逼真的大气条件,同时保持了场景的底层语义结构。白天到雾天和白天到日落模型表现出良好的结构保留,而白天到夜晚模型则突显了语义幻觉的挑战,例如由不平衡训练分布引起的人工海岸灯光生成。最终,这项工作建立了一个高效、结构感知的数据合成管道,直接解决了自主海洋导航中的数据稀缺瓶颈。

英文摘要

The development on robust perception systems for Maritime Autonomous Surface Ships (MASS) is heavily constrained by the scarcity of diverse training data, particularly for adverse weather and low-light conditions. Because collecting paired images in dynamic maritime environments is physically impossible, synthetic data generation via unpaired image-to-image translation offers a critical solution. However, existing generative models suffer from failing to preserve the fine structural details of small navigational objects due to latent compression bottlenecks. In this paper, we introduce a framework for generating synthetic maritime data using CycleGAN-turbo, a one-step unpaired translation architecture. By incorporating zero-convolution skip connections to bypass the Variational Autoencoder (VAE) bottleneck, our approach explicitly preserves small object details (e.g., distant vessels and sea marks) during translation. We compiled a dataset of 7,000 maritime images to train and evaluate models for Day-to-Foggy, Day-to-Sunset, and Day-to-Night domain translations. Qualitative evaluations and variable-strength inference studies demonstrate that our method effectively synthesizes realistic atmospheric conditions while maintaining the underlying semantic structure of the scene. The Day-to-Foggy and Day-to-Sunset models exhibit great structural retention, whereas the Day-to-Night model highlights the challenge of semantic hallucination, such as generating artificial coastal lights, induced by unbalanced training distributions. Ultimately, this work establishes an efficient, structure-aware data synthesis pipeline that directly addresses the data scarcity bottleneck in autonomous maritime navigation.

2606.03244 2026-06-03 cs.CL

When Does Complexity Conditioning Help a Frozen Sentence Embedding? A Controlled Study of Per-Sentence and Pair-Level Difficulty Adaptation

何时复杂度调节对冻结句子嵌入有帮助?基于逐句和句子对难度适配的受控研究

Suhwan Hwang

发表机构 * Suhwan Hwang

AI总结 通过受控实验研究冻结句子编码器后接轻量适配器时,基于句子级和句子对级难度信号的调节效果,发现句子对级残差门控在较大和分级任务上持续提升性能,而句子级方法无效。

Comments 13 pages, 3 figures, 2 tables

详情
AI中文摘要

一个常见的直觉是句子嵌入应适应输入的难度。我们在受控的多随机种子设置中测试这一直觉:一个轻量后编码器适配器附加到冻结的Qwen3-Embedding-0.6B编码器上,仅访问其最终池化嵌入,并在四个释义和语义相似度任务(PAWS、MRPC、QQP、STS-B)上评估。该想法的朴素形式失败:基于表面的逐句复杂度与冻结基线误差几乎不相关(Pearson约0.05),且相比常数或打乱对照无优势,同时降低饱和基线。即使目标与非循环的句子对难度信号对齐,逐句门控仍无法可靠捕获难度,因为难度主要是句子对的属性,而非单个句子。相比之下,由留出的交叉编码器难度信号门控的小型句子对级残差在较大和分级任务上持续提升,包括STS-B上+0.022 Spearman和QQP上+0.037,同时所有随机种子均锚定于冻结基线。由于这种有用形式操作于句子对而非单个句子,所得模型最好理解为缓存冻结嵌入上的轻量重排序器,而非替代的单向量嵌入;我们不声称达到最先进。我们的贡献是对难度感知适配何时有帮助何时失败的受控说明,以及预测可用余量的预训练诊断。

英文摘要

A common intuition is that sentence embeddings should adapt to the difficulty of the input. We test this intuition in a controlled, multi-seed setting: a lightweight post-encoder adapter attaches to a frozen Qwen3-Embedding-0.6B encoder, accessing only its final pooled embedding, and is evaluated on four paraphrase and semantic-similarity tasks (PAWS, MRPC, QQP, STS-B). The naive form of the idea fails: surface-based per-sentence complexity is nearly uncorrelated with frozen-baseline error (Pearson approximately 0.05) and provides no advantage over constant or shuffled controls, while degrading a saturated baseline. Even when the target is aligned to a non-circular pair-difficulty signal, the per-sentence gate still cannot reliably capture difficulty because difficulty is primarily a property of the pair, not the individual sentence. In contrast, a small pair-level residual gated by a held-out cross-encoder difficulty signal yields consistent gains on the larger and graded tasks, including +0.022 Spearman on STS-B and +0.037 on QQP, while remaining anchored to the frozen baseline across all seeds. Because this useful form operates on sentence pairs rather than individual sentences, the resulting model is best understood as a lightweight re-ranker over cached frozen embeddings, not a replacement single-vector embedding; we make no state-of-the-art claim. Our contribution is a controlled account of when difficulty-aware adaptation helps and when it fails, together with a pre-training diagnostic that predicts the available headroom.

2606.03243 2026-06-03 cs.CV

MemoGen: Can Past Experience Improve Future Text-to-Image Generation?

MemoGen:过去的经验能否改善未来的文本到图像生成?

Wenshuo Chen, Kuimou Yu, Bowen Tian, Jianfei Song, Shaofeng Liang, Haozhe Jia, Kan Cheng, Haosen Li, Kaishen Yuan, Lei Wang, Jiemin Wu, Songning Lai, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Scholarly, Guangzhou Ziyan Technology Co., Ltd.(学者,广州智源科技有限公司) LimX Dynamics Technology Co., Ltd.(LimX动力科技有限公司) Shandong University(山东大学) Data61/CSIRO Griffith University(Data61/CSIRO格里菲斯大学) Jiangsu Industrial Technology Research Institute (JITRI)(江苏工业技术研究院(JITRI))

AI总结 提出MemoGen框架,通过代理进化层和可重用经验记忆,在不更新生成器的情况下,利用过去经验改进文本到图像生成,在知识密集和推理基准上超越专有系统。

详情
AI中文摘要

现代文本到图像模型已实现强大的视觉合成,但在提示需要隐式视觉约束、关系推理或外部知识时仍不可靠。现有的检索增强和代理生成方法通过获取外部知识、参考或当前请求的优化提示来缓解此问题,但它们通常将每次生成视为孤立事件,并未系统性地保留过去的成功或失败以供将来使用。在这项工作中,我们探究文本到图像系统能否在不更新底层生成器的情况下,从自身的生成经验中持续改进。我们提出MemoGen,一种无需训练的框架,通过代理进化层增强现有图像生成器。对于每个任务,MemoGen显式推断视觉需求,必要时检索外部证据和参考,将其转化为可执行的生成约束,评估生成结果,并将任务理解、参考选择、视觉反馈、成功策略和失败教训存储为可重用的经验记忆。在进化轮次中,代理检索相关经验以改进类似的未来生成,选择性修复先前失败的案例同时保留成功的案例,从而实现在无需参数更新的情况下进行测试时自我进化。在知识密集和推理导向基准上的广泛实验证明了该范式的有效性:仅经过两轮进化,基于开源Qwen-Image骨干的MemoGen在WISE和Mind-Bench上超越了强大的专有系统,如Nano Banana Pro和GPT-Image-1,表明显式经验记忆可以作为可靠文本到图像生成的强大持续学习信号。

英文摘要

Modern text-to-image models have achieved strong visual synthesis, yet remain unreliable when prompts require implicit visual constraints, relational reasoning, or external knowledge. Existing retrieval-augmented and agentic generation methods mitigate this issue by acquiring external knowledge, references, or refined prompts for the current request, yet they typically treat each generation as an isolated episode and do not systematically preserve past successes or failures for future use. In this work, we ask whether a text-to-image system can continually improve from its own generation experience without updating the underlying generator. We propose MemoGen, a training-free framework that augments existing image generators with an agentic evolution layer. For each task, MemoGen explicitly infers visual requirements, retrieves external evidence and references when necessary, translates them into executable generation constraints, evaluates the generated result, and stores task understanding, reference choices, visual feedback, successful strategies, and failure lessons as reusable experience memory. Across evolution rounds, the agent retrieves relevant experience to improve similar future generations, selectively repairing previously failed cases while preserving successful ones, thereby enabling test-time self-evolution without parameter updates. Extensive experiments on knowledge-intensive and reasoning-oriented benchmarks demonstrate the effectiveness of this paradigm: after only two evolution rounds, MemoGen built upon the open-source Qwen-Image backbone surpasses strong proprietary systems such as Nano Banana Pro and GPT-Image-1 on WISE and Mind-Bench, showing that explicit experience memory can serve as a powerful continual learning signal for reliable text-to-image generation.

2606.03241 2026-06-03 cs.CL eess.AS

Benchmarking Speech-to-Speech Translation Models

语音到语音翻译模型基准测试

Alkis Koudounas, Hayato Futami, Quentin Jodelet, Osamu Take, Shinji Watanabe, Emiru Tsunoo

发表机构 * Sony Group Corporation, Japan(日本索尼集团公司) Carnegie Mellon University, USA(美国卡内基梅隆大学)

AI总结 提出统一可复现的基准框架COMPASS,集成46个指标评估语音到语音翻译模型,通过相关性过滤将指标缩减至10个,并验证了领域特定指标与人类判断的高度相关性。

Comments Paper under submission

详情
AI中文摘要

语音到语音翻译(S2ST)已取得快速进展,但离线评估缺乏统一协议:研究报告非重叠的指标子集,阻碍了直接比较。我们引入COMPASS,一个统一且可复现的基准测试框架,集成了跨八个维度的46个指标,并将其部署在来自FLEURS和CVSS的1,248个模型-语言配置上,涵盖级联和端到端架构的十种语言对。架构表现出互补优势:最佳与最差之间的差距在自然度和说话人保留方面超过30%,但在翻译质量上仅相差几个百分点,因此单一指标排名系统地歪曲了系统质量。相关性过滤将46个指标减少到每个方向10个,其中三个轴在X→EN和EN→X上需要不同的指标(例如,TER/UTMOS vs. ChrF++/NISQA-MOS);这些子集保留了排名(Spearman's ρ>0.80),同时将评估时间减少了约2.5倍。在配音、播客和医学领域的人类验证表明,独立的MOS预测器无法预测听众偏好,而顶级领域特定指标与人类判断相关(ρ≥0.90)。我们发布COMPASS作为领域感知S2ST评估的基础。

英文摘要

Speech-to-speech translation (S2ST) has advanced rapidly, but offline evaluation lacks a unified protocol: studies report non-overlapping metric subsets, preventing direct comparisons. We introduce COMPASS, a unified and reproducible benchmarking framework integrating 46 metrics across eight dimensions, and deploy it on 1,248 model-language configurations from FLEURS and CVSS, spanning cascaded and end-to-end architectures over ten language pairs. Architectures exhibit complementary strengths: best-vs-worst gaps exceed 30\% on naturalness and speaker preservation but remain within a few points on translation quality, so single-metric rankings systematically misrepresent system quality. Correlation filtering reduces 46 metrics to 10 per direction, with three axes requiring different metrics across X$\to$EN and EN$\to$X (e.g., TER/UTMOS vs. ChrF++/NISQA-MOS); these subsets preserve rankings (Spearman's $ρ>0.80$) while cutting evaluation time by $\approx 2.5\times$. Human validation across dubbing, podcasts, and medical domains shows standalone MOS predictors fail to predict listener preference, while top domain-specific metrics correlate with human judgment ($ρ\geq 0.90$). We release COMPASS as a foundation for domain-aware S2ST evaluation.

2606.03240 2026-06-03 cs.RO

GeoAlign: Beyond Semantics with State-Guided Spatial Alignment in VLA Models

GeoAlign: VLA模型中的状态引导空间对齐超越语义

Yizhi Chen, Zhanxiang Cao, Xinyi Peng, Yixiao Zheng, Xiaxi Si, Yiheng Li, Liyun Yan, Keqi Zhu, Xueyun Chen, Shengcheng Fu, Tianyue Zhan, Yufei Jia, Jinming Yao, Yan Xie, Kun Wang, Cewu Lu, Yue Gao

发表机构 * Tongji University(同济大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学) Jingdezhen Ceramic University(景德镇陶瓷大学) Tsinghua University(清华大学) HONOR(HONOR公司) University of Science and Technology of China(中国科学技术大学)

AI总结 提出GeoAlign架构,通过RGB几何分支的后训练和机器人本体状态引导的几何特征查询,实现几何感知的空间对齐和动态可供性选择,在多个基准上取得高性能。

Comments 20 pages, 9 figures, 8 tables, including appendix

详情
AI中文摘要

当前的视觉-语言-动作(VLA)模型通常优化语义基础,而可执行的操纵需要几何感知的空间对齐和动态可供性选择。我们引入了GeoAlign,一种用于VLA策略学习的状态引导空间对齐架构。GeoAlign使用机器人领域的RGB-D监督对RGB几何分支进行后训练,生成RGB衍生的几何增强后训练(GEP)特征用于策略部署。机器人的本体状态查询GEP特征网格,产生紧凑的、相位相关的几何令牌用于动作预测。GeoAlign在LIBERO上达到99.0%,在三个SimplerEnv-Fractal任务上达到85.3%,在八个几何关键的真实世界ALOHA任务上达到78.8%,消融实验证实了几何后训练和本体状态引导查询的价值。

英文摘要

Current Vision--Language--Action (VLA) models often optimize for semantic grounding, whereas executable manipulation requires geometry-aware spatial alignment and dynamic affordance selection. We introduce GeoAlign, a state-guided spatial alignment architecture for VLA policy learning. GeoAlign post-trains an RGB geometry branch with robot-domain RGB-D supervision, yielding RGB-derived Geometry-Enhanced Post-Trained (GEP) features for policy rollout. The robot's proprioceptive state queries the GEP feature grid, producing compact, phase-dependent geometry tokens for action prediction. GeoAlign achieves 99.0% on LIBERO, 85.3% across three SimplerEnv-Fractal tasks, and 78.8% on eight geometry-critical real-world ALOHA tasks, with ablations confirming the value of geometry post-training and proprioceptive-state-guided querying.

2606.03239 2026-06-03 cs.CL

ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents

ARBOR: 通过可复用评分标准缓冲区的在线过程奖励用于搜索智能体

Zheng Liu, Longxiang Zhang, Xintong Wang, Zhiang Xu, Shaoxiong Zhan, Xin Shan, Wen Huang, Tao Dai, Shu-Tao Xia, Chengfu Huo, Liang Ding

发表机构 * Tsinghua University(清华大学) Alibaba Group(阿里巴巴集团) Peking University(北京大学) Shenzhen University(深圳大学)

AI总结 针对基于LLM的搜索智能体训练中结果奖励缺乏过程监督的问题,提出ARBOR框架,通过维护跨查询共享的评分标准记忆库,利用对比轨迹生成局部草案并整合为通用评分标准,以稀疏成对判断提供过程级梯度,在四个多跳QA基准上优于GRPO和DAPO基线。

详情
AI中文摘要

基于LLM的搜索智能体主要使用结果奖励进行训练,搜索过程本身无监督。这种信号在结果同质组(所有采样轨迹共享相同正确性)中退化,产生零组内优势和无梯度。现有的过程监督要么训练昂贵的验证器,要么生成每个查询的评分标准,这些评分标准在查询间不一致且使用一次后丢弃。我们提出ARBOR(自适应评分标准缓冲区用于在线奖励),一种可复用的过程奖励框架,维护跨查询共享的评分标准记忆。由对比轨迹诱导的查询局部草案被接纳、整合为跨查询通用评分标准,并随策略演化而淘汰。一小部分活跃的通用评分标准通过稀疏成对判断对轨迹评分,所得分数加到基础奖励上,即使在结果奖励一致时也能提供过程级梯度。ARBOR在四个多跳QA基准上持续优于GRPO和DAPO基线,将LLM评判准确率平均提高最多4.2个百分点,并将最多42%的原本零梯度训练组转化为信息丰富的组。

英文摘要

LLM-based search agents are trained predominantly with outcome-only reward, leaving the search process itself unsupervised. This signal degenerates on outcome-homogeneous groups where all sampled trajectories share the same correctness, yielding zero within-group advantage and no gradient. Existing process supervision either trains a costly verifier or generates per-query rubrics that are inconsistent across queries and discarded after one use. We propose ARBOR (Adaptive Rubric Buffer for Online Reward), a reusable process-reward framework that maintains a rubric memory shared across queries. Query-local drafts induced from contrastive trajectories are admitted, consolidated into cross-query common rubrics, and retired as the policy evolves. A small active subset of common rubrics scores trajectories via sparse pairwise judging, and the resulting scores are added to the base reward, providing process-level gradient even when outcome reward is uniform. ARBOR consistently outperforms GRPO and DAPO baselines on four multi-hop QA benchmarks, raising average LLM-judge accuracy by up to 4.2 points and converting up to 42% of otherwise-zero-gradient training groups into informative ones.

2606.03238 2026-06-03 cs.LG cs.AI

When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming

当RLHF失败时:奖励黑客、崩溃和评估者博弈的机制分类

Zelalem Abahana

发表机构 * First Citizens Bank(第一公民银行) Alma Mater Europaea University(欧洲大学)

AI总结 本文通过PPO、DPO等方法的对比实验,提出了一种基于奖励和评估者分数方向的机制分类法,将RLHF失败模式分类为可定位、可预测的训练动态。

Comments 20 pages, 8 figures; includes code, artifacts, and live demo

详情
AI中文摘要

从人类反馈中强化学习(RLHF)通过用学习到的可扩展代理替代未明确指定的人类目标,实现了大规模后训练。这种替代同时创建了一个结构化的失败面:优化可以提高学习到的奖励而外部质量下降,降低代理和评估者分数,揭示代理欠对齐,或产生评估者特定的分歧。我们展示了一个紧凑RLHF流程的实证失败模式研究,该流程包括近端策略优化(PPO)、直接偏好优化(DPO)、不确定性惩罚PPO(UP-PPO)、奖励模型不确定性、近似策略漂移、多样性和重复诊断,以及两个外部LLM评估者。我们不将奖励黑客视为单一终端事件,而是使用学习到的奖励、评估者分数和平均评估者分数的方向对检查点之间的匹配转换进行分类。在61个检查点行和1920个行级转换中,激进的PPO具有最高的局部奖励黑客率(14.45%;bootstrap 95% CI: 10.16-18.75),而UP-PPO在相同激进机制下产生较低率(11.33-10.94%)。转换前的逻辑模型以ROC-AUC 0.821预测未来行级奖励黑客,行级分析发现12个设置中有3个存在检查点平均值遗漏的局部奖励黑客。核心结论是方法论上的:RLHF失败不仅是最终模型病理,而且是可分类、可定位和部分可预测的训练动态。

英文摘要

Reinforcement learning from human feedback (RLHF) makes large-scale post-training possible by replacing an underspecified human objective with learned and scalable proxies. The same substitution creates a structured failure surface: optimization can raise the learned reward while external quality falls, degrade both proxy and judge scores, reveal proxy under-alignment, or produce evaluator-specific disagreement. We present an empirical failure-mode study of a compact RLHF pipeline with proximal policy optimization (PPO), direct preference optimization (DPO), uncertainty-penalized PPO (UP-PPO), reward-model uncertainty, approximate policy drift, diversity and repetition diagnostics, and two external LLM judges. Rather than treating reward hacking as a single terminal event, we classify matched transitions between checkpoints using the directions of the learned reward, judge scores, and average judge score. Across 61 checkpoint rows and 1920 row-level transitions, aggressive PPO has the highest localized reward-hacking rate (14.45%; bootstrap 95% CI: 10.16-18.75), while UP-PPO yields lower rates in the same aggressive regime (11.33-10.94%). A pre-transition logistic model predicts future row-level reward hacking with ROC-AUC 0.821, and row-level analysis finds localized reward hacking that checkpoint averages miss in 3 of 12 settings. The central conclusion is methodological: RLHF failures are not only final-model pathologies, but training dynamics that can be classified, localized, and partially anticipated.

2606.03237 2026-06-03 cs.AI cs.CL cs.CY cs.LG cs.MA

Solipsistic Superintelligence is Unlikely to be Cooperative

唯我论超级智能不太可能合作

Rakshit S Trivedi, Natasha Jaques, Logan Cross, Alexander Sasha Vezhnevets, Joel Z Leibo

发表机构 * DeepMind(深度Mind) University of Cambridge(剑桥大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文指出,基于唯我论方法设计的超级智能(极端能力的任务求解器)因忽视部署引发的内生非平稳性而难以合作,呼吁将相互依存作为核心设计原则的非唯我论研究范式。

Comments 24 pages, 1 figure, Accepted at Proceedings of the 43rd International Conference on Machine Learning, 2026

详情
AI中文摘要

AI的核心挑战正从能力转向共存。AI研究的主导范式侧重于开发将世界视为外生且平稳反馈源的强大智能体。我们认为,源于这种唯我论AI设计方法的超级智能(极端能力的任务求解器)不太可能合作。部署AI系统会引发内生非平稳性,导致训练-测试-部署差距,即历史分布与部署环境相偏离。我们称此为单边优化的自我削弱属性。缩小这一差距需要参与合作的AI:即多个行为体导航其相互依存的均衡选择过程。我们呼吁一种非唯我论的研究范式,将这种相互依存作为核心设计原则,而非将合作视为待解决的任务。这需要构建涉及自适应对手方的动态评估测试平台,将制度视为设计原语,并保留人类能动性作为我们构建系统的结构性特征。

英文摘要

AI's central challenge is shifting from capability to coexistence. The dominant paradigm in AI research focuses on developing powerful agents that treat the world as an exogenous and stationary source of feedback. We contend that superintelligence, an extremely capable task solver, born out of such a solipsistic approach to AI design, is unlikely to be cooperative. Deploying AI systems induces endogenous non-stationarity, resulting in a train-test-deploy gap where historical distributions diverge from the deployment context. We refer to this as the self-undermining property of unilateral optimization. Closing this gap requires AI that participates in cooperation: the equilibrium-selection process through which multiple actors navigate their interdependence. We call for a non-solipsistic research paradigm that treats this interdependence as a core design principle rather than approaching cooperation as a task to solve. This entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build.

2606.03236 2026-06-03 cs.AI

Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

先感知后推理:一种用于高效可靠主动移动代理的预推理感知框架

Zhijie Ding, Weinan Hong, Zicheng Zhu, Lei Li, Dezhi Kong, Hao Wang, Peng Zhou, Xuchu Jiang, Jiaming Xu

发表机构 * HyperAI Team, Xiaomi Corporation(HyperAI团队,小米公司) Zhongnan University of Economics and Law(中南财经政法大学) Jilin University(吉林大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出预推理感知框架(PRPF),通过轻量级多模态主动感知器(MPP)进行干预门控和上下文压缩,仅在需要时激活主动代理推理器(PAR),以解决主动移动代理中干预时机与方式决策的目标错位和冗余推理问题。

详情
AI中文摘要

多模态大语言模型(MLLMs)显著推动了移动代理的发展,但主动移动辅助仍然具有挑战性,因为代理必须在决定如何协助之前确定何时干预。现有系统通常在一个统一的基于MLLM的流水线中实现这两个决策,导致保守的干预过滤与全面的辅助生成之间的目标错位,以及在代理应保持沉默时的冗余推理。为了解决这些限制,我们提出了预推理感知框架(PRPF),这是一个基于先感知后推理的两阶段框架。PRPF引入了一个轻量级的多模态主动感知器(MPP)用于干预门控和上下文压缩,并仅在需要干预时激活主动代理推理器(PAR)。在ProactiveMobile基准上的实验表明,与ProactiveMobile基线相比,PRPF显著降低了误触发率(FTR),同时提高了成功率(SR)和推理效率。

英文摘要

Multimodal large language models (MLLMs) have substantially advanced mobile agents, yet proactive mobile assistance remains challenging because agents must decide \emph{when} to intervene before determining \emph{how} to assist. Existing systems often implement these two decisions within a unified MLLM-based pipeline, leading to goal misalignment between conservative intervention filtering and comprehensive assistance generation, as well as redundant inference when the agent should remain silent. To address these limitations, we propose the \textbf{Pre-Reasoning Perception Framework (PRPF)}, a two-stage framework built on perceiving before reasoning. PRPF introduces a lightweight Multimodal Proactive Perceptor (MPP) for intervention gating and context compression, and activates the Proactive Agent Reasoner (PAR) only when intervention is warranted. Experiments on the ProactiveMobile benchmark show that PRPF substantially reduces false trigger rates (FTR) while improving success rates (SR) and inference efficiency over the ProactiveMobile baseline.

2606.03234 2026-06-03 cs.LG

Right Makes Might: Aligning Verified Hidden States Empowers RL Reasoning

正确即力量:对齐验证的隐藏状态增强强化学习推理

Ziyue Wang, Aomufei Yuan, Yongfu Zhu, Shuai Dong, Wenpu Liu, Yiran Yao, Weichu Xie, Yuqi Xu, Caoyuan Ma, Wenqi Shao, Xiaoying Zhang, Nan Duan, Jiaqi Wang

发表机构 * Peking University(北京大学) JINGDONG(京东) Shanghai Innovation Institute(上海创新研究院) The University of Tokyo(东京大学) Tianjin University(天津大学)

AI总结 提出Hidden-Align辅助损失函数,在强化学习训练中对齐正确rollout在锚点token处的最后一层隐藏状态,提升数学推理性能。

Comments 16 pages, 7 figures

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为提升大语言模型数学推理的主流方法,但当前方法将每个正确rollout简化为单个奖励比特,忽略了其隐藏状态共享的几何结构。研究这一结构发现,在锚点token(答案标记前的位置)处,正确rollout自然收敛,因为它们必须产生相同答案(余弦相似度约0.84),但每个rollout仍保留其独特推理路径的残余方差。鼓励在该点完全对齐,推动模型提取统一的“正确决策”表示,减少对推理路径的敏感性。基于此观察,我们提出Hidden-Align,一种辅助损失函数,在RL训练中对齐正确rollout在锚点token处的最后一层隐藏状态,训练和推理中零开销。在八个数学推理基准上,Hidden-Align在Qwen3-1.7B、4B和14B上分别比DAPO基线平均提升pass@1 3.8、6.2和5.4个百分点,且在所有三种规模上pass@k一致提升,消融实验支持了损失类型、锚点位置、层深度和损失权重的影响。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has become the dominant approach for improving mathematical reasoning in large language models, yet current methods reduce each correct rollout to a single reward bit, ignoring the geometric structure shared among their hidden states. Investigating this structure, we find that at the anchor token (the position immediately before the answer marker), correct rollouts converge naturally because they must produce the same answer (cosine similarity ~0.84), yet each retains residual variance from its unique reasoning path. Encouraging full alignment at this point pushes the model to extract a unified "correct decision" representation, reducing sensitivity to which reasoning path was taken. Based on this observation, we propose Hidden-Align, an auxiliary loss function that aligns the last-layer hidden states of correct rollouts at the anchor token during RL training, with zero overhead in both training and inference. On eight mathematical reasoning benchmarks, Hidden-Align improves average pass@1 over the DAPO baseline by 3.8, 6.2, and 5.4 percentage points on Qwen3-1.7B, 4B, and 14B respectively, with consistent pass@k gains across all three scales, supported by ablations on loss type, anchor position, layer depth, and loss weight.

2606.03232 2026-06-03 cs.LG cs.AI

GFFMERGE: Efficient Merging of Graph Neural Force Fields and Beyond

GFFMERGE: 图神经力场的高效合并及其扩展

Parth Verma, Parv P. Singh, Vipul Garg, Ishita Thakre, N. M. Anoop Krishnan, Sayan Ranu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Cambridge(剑桥大学)

AI总结 提出GFFMERGE框架,通过凸嵌入对齐问题解析解实现图神经网络的闭式模型合并,在力场回归任务中恢复接近联合训练的性能,并实现5-27倍加速。

详情
AI中文摘要

图神经网络(GNN)通过降低计算成本实现接近量子精度的原子模拟,彻底改变了神经力场,但将这些模型适应新化学系统需要对基础模型进行昂贵的重新训练。受视觉和语言处理中模型合并的启发,我们提出了GFFMERGE,这是第一个用于GNN闭式模型合并的原则性框架。我们利用消息传递层的线性结构,将合并问题形式化为具有解析解的凸嵌入对齐问题。通过对GNN模型合并的首次系统基准测试,我们发现为视觉和语言设计的现有方法在力场回归任务上灾难性地失败,而GFFMERGE恢复了接近黄金标准联合训练的性能。在分子(MD17、MD22)、固态(LiPS20)和大规模图基准测试中,GFFMERGE及其通用GNN对应物GNNMERGE实现了5-27倍的加速,同时支持专业模型的模块化组合。值得注意的是,我们的闭式解在微调前就优于所有基线方法,并为更快、数据高效的收敛提供了优越的初始化。

英文摘要

Graph Neural Networks (GNNs) have revolutionized Neural Force Fields for atomistic simulations, achieving near-quantum accuracy at reduced cost, yet adapting these models to new chemical systems requires expensive retraining of foundation models. Inspired by model merging in vision and language processing, we introduce GFFMERGE, the first principled framework for closed-form model merging in GNNs. We exploit the linear structure of message-passing layers and formulate merging as a convex embedding-alignment problem with an analytical solution. Through the first systematic benchmarking of model merging for GNNs, we show that existing methods designed for vision and language catastrophically fail on force field regression, while GFFMERGE recovers performance approaching gold standard joint training. Across molecular (MD17, MD22), solid-state (LiPS20), and large-scale graph benchmarks, GFFMERGE and GNNMERGE (its generic GNN counterpart) achieve 5-27$\times$ speedups while enabling modular composition of specialized models. Remarkably, our closed-form solution alone outperforms all baseline methods before fine-tuning and provides superior initialization for faster, data-efficient convergence.

2606.03227 2026-06-03 cs.LG

Learning Temporal Causal Structure via Smooth Differentiable Optimization

通过平滑可微优化学习时间因果结构

Tong Zhao, Ce Guo, Wayne Luk, Emil Lupu, Ray Dipojjwal

发表机构 * Imperial College London(帝国理工学院伦敦分校) University of Bristol(布里斯托大学)

AI总结 提出使用Gumbel-Sinkhorn算子学习可微变量排序,三角化结构向量自回归模型的瞬时系数矩阵,将无环性转化为参数化,实现统一连续优化,提高时间序列因果发现的效率和准确性。

详情
AI中文摘要

多变量时间序列中具有瞬时效应的因果发现具有挑战性,因为瞬时结构必须是无环的。先前的方法通过将瞬时和滞后估计分离为多阶段流水线,或通过复杂的增广拉格朗日优化施加代数无环性约束来强制执行这一点,这两种方法都 incur 高计算成本。在这项工作中,我们提出了一种不同的方法:我们使用Gumbel-Sinkhorn算子学习变量的可微排列,并按照学习到的顺序三角化结构向量自回归(SVAR)模型的瞬时系数矩阵。这将无环性从硬约束转化为参数化,并在整个优化过程中保持其有效性。通过这样做,我们的方法实现了基于梯度的学习的统一连续优化,从而提高了时间序列因果发现的效率。在三个真实世界基准测试中,我们的方法在发现准确性和效率方面均优于12个基线方法,取得了最佳整体性能。在大规模基准测试中,它进一步展示了强大的可扩展性,实现了比竞争方法快6倍以上的加速。

英文摘要

Causal discovery with instantaneous effects in multivariate time series is challenging, as the instantaneous structure must be acyclic. Prior methods enforce this by either separating instantaneous and lagged estimation into multi-stage pipelines or imposing algebraic acyclicity constraints via complex augmented Lagrangian optimization, both of which incur high computational cost. In this work, we propose a different approach: we learn a differentiable permutation of variables using the Gumbel--Sinkhorn operator and triangularize the instantaneous coefficient matrix of a Structural Vector Autoregressive (SVAR) model in the learned order. This converts acyclicity from a hard constraint into a parameterization and keeps it valid throughout optimization. In doing so, our method enables unified, continuous optimization with gradient-based learning, leading to improved efficiency in time--series causal discovery. Across three real-world benchmarks, our method achieves the best overall performance compared with 12 baselines in both discovery accuracy and efficiency. On the large-scale benchmark, it further demonstrates strong scalability, achieving more than a 6x speedup over competing methods.

2606.03223 2026-06-03 cs.RO cs.AI

BotDirector: Robot Storytelling Across the Symmetrical Reality with Multi-modal Interactions

BotDirector:跨对称现实的多模态交互机器人讲故事

Zhe Sun, Meng Wang, Lei Wang, Yuxi Wang, Wanxin Li, Yujia Peng, Zhenliang Zhang

发表机构 * State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China(国家一般人工智能重点实验室,BIGAI,北京,中国) Peking University, Beijing, China(北京大学,北京,中国)

AI总结 提出一个结合具身交互和自然语言交互的机器人讲故事系统,利用LLM代理将儿童创建的叙事转化为自导航群体机器人的运动序列,支持灵活场景和日常物品。

详情
Journal ref
2026 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)
AI中文摘要

机器人讲故事融合了技术创新和创意表达,以前所未有的方式吸引儿童。然而,技术方面往往对儿童来说过于复杂。我们提出了一个交互式系统,通过具身和自然语言交互促进机器人讲故事。儿童用自己的物品布置游乐场,并与LLM代理一起创建叙事。创建的叙事基于地图和角色转化为运动序列,并由自导航群体机器人执行。该系统增强了机器人讲故事的灵活性,使幼儿能够用日常物品创作机器人戏剧。

英文摘要

Robot storytelling offers a unique blend of technological innovation and creative expression that engages children in unprecedented ways. However, the technical aspects are often too complicated for children. We propose an interactive system that facilitates robot storytelling with tangible and natural language interactions. Children arrange the playground with their own stuff and create narratives with an LLM agent. The created narratives are transformed into a motion sequence based on the map and characters, and the motions are executed by self-navigating swarm robots. This system enhances robot storytelling with flexible scenarios, enabling young children to create robot dramas with everyday objects.

2606.03220 2026-06-03 cs.CL cs.AI

WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts

WebRISE: 面向MLLM生成Web工件的需求诱导状态评估

Yuxin Meng, Yuhan Suo, Junjie Wang, Yuhan Sun, Yiyao Yu, Ruixu Zhang, Ruining Hu, Yubin Wang, Shouwei Ruan, Bin Wang, Yuxiang Zhang, Yujiu Yang

发表机构 * Tsinghua University(清华大学) Huawei Noah’s Ark Lab(华为诺亚实验室) East China Normal University(华东师范大学) Tongji University(同济大学) Institute of Artificial Intelligence, Beihang University(北京航空航天大学人工智能研究院)

AI总结 提出WebRISE框架,通过交互契约图(ICG)将任务需求转化为可观察状态、用户意图转换和DOM/视觉断言,以评估MLLM生成的Web工件的功能正确性,实验表明ICG评分检测状态错误率是检查点评估的2-16倍。

详情
AI中文摘要

现有的MLLM生成Web工件基准通过局部证据评估交互,忽略了决定页面是否正常工作的需求诱导状态和转换。我们提出WebRISE,它将任务需求编译成交互契约图(ICG),包含可观察状态、用户意图转换以及DOM/视觉断言,以实现与实现无关的浏览器执行。WebRISE涵盖五种输入模态(文本、Markdown、草图、图像、视频)下的442个任务,包含5,495个转换和5,271个需求检查,将用户声明的功能与隐式的产品级约束分开。在14个MLLM中,即使最强的模型也仅达到65.6%的转换有效性和66.3%的需求覆盖率,且视觉质量不能代表行为(Qwen3.6-35B-A3B在Markdown上:V=80.8但T=15.5)。视频提供了最强的交互信号(隐式覆盖率比文本高10.6个百分点),而隐式约束仍然存在;缺陷注入表明,基于ICG的评分检测状态错误的速率是检查点评估的2-16倍。

英文摘要

Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.

2606.03219 2026-06-03 cs.CL cs.LG

Sample-Size Scaling of the African Languages NLI Evaluation

非洲语言自然语言推理评估的样本量缩放

Anuj Tiwari, Oluwapelumi Ogunremu, Terry Oko-odion, Jesujuwon Egbewale, Hannah Nwokocha

发表机构 * Noida Institute of Engineering and Technology(奈德人工智能工程与技术学院) ML Collective(机器学习集体)

AI总结 本研究通过AfriXNLI基准对16种非洲语言进行系统样本量缩放实验,发现NLI性能随样本量增加并非单调提升,而是呈现语言敏感且非单调的缩放行为,表明数据量不足以保证稳定收益,需语言敏感的数据集和更强多语言建模策略。

Comments Accepted at the AfricaNLP Workshop, EACL 2026

详情
AI中文摘要

非洲语言标注数据非常少,且增加标注数据量是否能可靠提升下游性能尚不明确。本研究基于AfriXNLI基准,对16种非洲语言进行了自然语言推理(NLI)的系统样本量缩放研究。在受控条件下,测试了两个约0.6B参数的多语言Transformer模型(在XNLI上微调的XLM-R Large和AfroXLM-R Large),样本量从50到500个标注示例不等,并在随机子采样运行中平均结果。与通常认为的随数据增加性能单调提升相反,我们发现了一种强烈语言敏感且通常非单调的缩放行为。一些语言在低资源场景下表现出早期饱和或性能下降,以及高方差。这些结果表明,数据量不足以保证非洲NLI的稳定收益,因此需要创建语言敏感的数据集和更强的多语言建模策略。

英文摘要

African languages have very little labelled data, and it is unclear if augmenting the quantity of annotation data reliably enhances downstream performance. The study is a systematic sample-size scaling study of natural language inference (NLI) on 16 African languages based on the AfriXNLI benchmark. Under controlled conditions, two multilingual transformer models with roughly 0.6B parameters XLM-R Large fine-tuned on XNLI and AfroXLM-R Large are tested on sample sizes of between 50 and 500 labeled examples and average their results across random subsampling runs. As opposed to the usual belief of monotonic increase with increased data, we find a strongly language sensitive and often non-monotonic scaling behavior. Some languages show early saturation or decrease in performance with sample size as well as high variance in low resource regimes. These results indicate that the volume of data is not enough to guarantee stable profits to African NLI, creating the necessity of language sensitive datasets creation and stronger multi-lingual modelling strategies.

2606.03216 2026-06-03 cs.CV

Follow-Your-Preference++: Rethinking Preference Alignment for Image Inpainting

Follow-Your-Preference++:重新思考图像修复中的偏好对齐

Junkun Yuan, Yutao Shen, Toru Aonishi, Hideki Nakayama, Yue Ma

发表机构 * Zhejiang University(浙江大学) The University of Tokyo(东京大学) Tsinghua University(清华大学)

AI总结 本文从基本原理出发,通过直接偏好优化框架和公开奖励模型构建偏好数据,系统研究了图像修复中的偏好对齐问题,发现奖励模型存在偏差但可通过集成缓解,并在标准指标、大视觉语言模型评估和人类评估上显著超越先前最先进模型。

Comments 23 pages, 14 figures. arXiv admin note: substantial text overlap with arXiv:2509.23082

详情
AI中文摘要

我们研究图像修复中的偏好对齐。与其提出另一种方法,我们从头重新审视该问题并重新评估其核心挑战。我们采用广泛使用的直接偏好优化框架,并利用公开的奖励模型构建偏好训练数据。我们的实证研究涵盖九个奖励模型、两个基准以及两个在架构和生成机制上不同的基线修复模型。我们的主要发现是:(1) 大多数奖励模型为偏好数据构建提供了有效信号,尽管有些作为评估者不可靠。(2) 跨模型和基准,偏好数据在候选和样本缩放下表现出一致的趋势。(3) 奖励模型显示出明显的偏差——特别是在亮度、构图和配色方案方面——使其容易引发奖励黑客行为。(4) 简单的奖励模型集成减轻了此类偏差,并产生了稳健且可泛化的性能。(5) 偏好对齐可迁移到对象移除任务,其中目标从开放式创意生成转变为连贯的背景补全。(6) 进一步分析表明,校准的集成方法进一步减轻了黑客行为并提高了鲁棒性。在不修改模型架构或引入额外数据集的情况下,我们的模型在标准指标、大视觉语言模型评估和人类评估上显著优于先前最先进的模型。我们的代码可在以下网址获取:此 https URL。

英文摘要

We study preference alignment for image inpainting. Rather than proposing yet another method, we revisit the problem from first principles and reassess its core challenges. We adopt the widely used direct preference optimization framework and construct preference training data with publicly available reward models. Our empirical study spans nine reward models, two benchmarks, and two baseline inpainting models that differ in architecture and generative mechanism. Our main findings are: (1) Most reward models provide valid signals for preference data construction, although some are unreliable as evaluators. (2) Across models and benchmarks, preference data exhibits consistent trends under both candidate and sample scaling. (3) Reward models display pronounced biases--particularly in brightness, composition, and color scheme--that make them prone to inducing reward hacking. (4) A simple ensemble of reward models mitigates such biases and yields robust, generalizable performance. {\color{rebuttal_blue}(5) Preference alignment is transferable to the object removal task, where the goal shifts from open-ended creative generation to coherent background completion. (6) Further analysis reveals that a calibrated ensemble method further mitigates hacking and improves robustness.} Without modifying model architectures or introducing additional datasets, our models substantially outperform prior state-of-the-art models on standard metrics, large vision-language model evaluations, and human assessments. Our code is available at: https://github.com/shenytzzz/Follow-Your-Preference.

2606.03214 2026-06-03 cs.AI cs.CV cs.CY cs.LG

Effect of Demographic Bias on Skin Lesion Classification

人口统计偏差对皮肤病变分类的影响

Ralf Raumanns, Gerard Schouten, Veronika Cheplygina, Josien P. W. Pluim

发表机构 * Fontys University of Applied Science, Venlo, The Netherlands(Fontys应用科学大学,荷兰Venlo) Fontys University of Applied Science, Eindhoven, The Netherlands(Fontys应用科学大学,荷兰Eindhoven) Eindhoven University of Technology, Eindhoven, The Netherlands(埃因霍温技术大学,荷兰Eindhoven) IT University of Copenhagen, Denmark(哥本哈根IT大学,丹麦)

AI总结 本研究使用基于ResNet的卷积模型评估皮肤病变分类性能,通过线性规划控制人口统计特征,研究患者性别和年龄偏差的影响,并比较三种学习策略,发现性别偏差主要源于数据不平衡,而年龄偏差始终偏向年轻群体。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) , 26 pages, 12 figures

详情
Journal ref
https://melba-journal.org/2026:011
AI中文摘要

在这项研究中,我们评估了使用基于ResNet的卷积模型进行皮肤病变分类的性能,重点关注训练数据中人口统计偏差的影响,特别是患者性别和年龄的变化。我们使用线性规划生成具有受控人口统计特征的数据集,从而系统性地研究偏差效应。评估了三种学习策略:单任务模型、强化多任务模型和对抗学习方案。我们的性别分析表明,性别特定的训练数据集优化了模型性能。值得注意的是,在训练数据中包含男性患者提高了男性亚组的性能,即使在女性占多数的情况下也是如此。强化学习和对抗学习方案缩小或消除了平衡和女性占多数数据集中的偏差差距。然而,这些策略在男性占多数的环境中效果较差,模型在男性上的表现仍然优于女性。在主要男性患者群体中,与基线模型相比,这两种学习方案显示出边际偏差减少。基于年龄的分析表明,三种模型方法的基线性能相当,性能随年龄类别下降。无论训练数据分布如何,年轻组始终达到最高性能。尽管平衡训练对最年轻年龄组产生最佳结果,但较老年组的性能下降。我们发现性别偏差主要源于数据不平衡,而年龄偏差无论分布如何始终偏向年轻群体。这些不同的机制需要有针对性的缓解策略。此外,在两个外部数据集上的跨数据集验证表明,域转移显著影响性能和人口统计偏差模式。

英文摘要

In this study, we evaluate the performance of skin lesion classification using ResNet-based convolutional models, focusing on the impact of demographic bias in training data, particularly variations in patient sex and age. We use linear programming to generate datasets with controlled demographic characteristics, allowing systematic investigation of bias effects. Three learning strategies are evaluated: a single-task model, a reinforcing multi-task model, and an adversarial learning scheme. Our sex-based analysis indicates that sex-specific training datasets optimise model performance. Notably, including male patients in the training data improved performance for the male subgroup, even in female-majority cases. Reinforcing and adversarial learning schemes narrowed or eliminated bias gaps in balanced and female-majority datasets. However, these strategies proved less effective in male-majority settings, where models continued to perform better for males than females. The two learning schemes showed marginal bias reduction compared to the baseline model in predominantly male patient populations. Age-based analysis demonstrates comparable baseline performance across the three model approaches, with performance declining across age categories. Younger groups consistently achieve the highest performance, regardless of training data distribution. Although balanced training yields optimal results for the youngest age category, performance decreases in older categories. We find that sex biases arise mainly from data imbalances, while age biases consistently favour younger groups regardless of distribution. These distinct mechanisms require targeted mitigation strategies. Additionally, cross-dataset validation on two external datasets revealed that domain shifts notably affect performance and patterns of demographic bias.

2606.03209 2026-06-03 cs.LG

DECA: Decentralizing Block-Wise Adam for Efficient LLM Full-Parameter Fine-Tuning on Non-IID Data

DECA: 去中心化逐块Adam优化器用于非独立同分布数据上的高效大语言模型全参数微调

Yunsheng Yuan, Shaowei Li, Kai Wang, Zhongyuan Sun, Zheng Zhang, Kai Han, Jun Luo, Feng Li

发表机构 * School of Computer Science and Technology, Shandong University, Qingdao China(山东大学计算机科学与技术学院,青岛中国) School of Mathematical Science, Peking University, China(北京大学数学科学学院,中国) IEIT SYSTEM, China(IEIT SYSTEM,中国) School of Computer Science and Artificial Intelligence, Shanghai University of Finance and Economics, Shanghai, China(上海财经大学计算机科学与人工智能学院,上海中国) College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学计算与数据科学学院,新加坡)

AI总结 针对隐私敏感和资源受限环境中的大语言模型微调,提出DECA框架,通过逐块Adam优化和去中心化共识机制,在非独立同分布数据上实现高效的全参数微调,兼顾收敛速度、下游性能和资源效率。

详情
AI中文摘要

在隐私敏感和资源受限的环境中微调大语言模型(LLM)仍然具有挑战性。由于训练数据通常分布在多个客户端上,去中心化微调提供了一种无需中央服务器的协作适应自然范式。然而,在这种去中心化设置中实现全参数微调(FPFT)是困难的:FPFT提供了强大的适应能力,但对于十亿级模型来说会带来高昂的资源消耗。因此,现有的去中心化LLM微调方法主要依赖于参数高效更新,这提高了效率但可能限制下游性能。此外,客户端数据通常是非独立同分布的,这使得去中心化优化更容易受到客户端漂移和不稳定收敛的影响。为了解决这些挑战,我们提出了DECA,一种用于非独立同分布数据上LLM的资源高效去中心化FPFT框架。DECA将模型参数划分为不相交的块,并执行顺序逐块Adam优化,在保持去中心化全参数适应的同时减少资源消耗。为了稳定训练,DECA进一步引入了基于新鲜局部梯度统计和共识衍生差异信号的一阶和二阶逐块矩估计。我们提供了严格的理论分析和广泛的实验,表明DECA实现了快速收敛、强大的下游性能和显著的资源效率。

英文摘要

Fine-tuning large language models (LLMs) in privacy-sensitive and resource-constrained environments remains challenging. Since training data are often distributed across multiple clients, decentralized fine-tuning offers a natural paradigm for collaborative adaptation without a central server. However, enabling full-parameter fine-tuning (FPFT) in this decentralized setting is difficult: FPFT provides strong adaptation capacity but incurs prohibitive resource consumption for billion-scale models. Existing decentralized LLM fine-tuning methods therefore mainly rely on parameter-efficient updates, which improve efficiency but may restrict downstream performance. Moreover, client data are typically non-IID, making decentralized optimization more vulnerable to client drift and unstable convergence. To address these challenges, we propose DECA, a resource-efficient decentralized FPFT framework for LLMs on non-IID data. DECA partitions model parameters into disjoint blocks and performs sequential block-wise Adam optimization, reducing resource consumption while preserving decentralized full-parameter adaptation. To stabilize training, DECA further introduces first- and second-order block-wise moment estimates with fresh local gradient statistics and consensus-derived discrepancy signals. We provide rigorous theoretical analysis and extensive experiments, showing that DECA achieves fast convergence, strong downstream performance, and significant resource efficiency.

2606.03204 2026-06-03 cs.RO eess.SP

Toward Gripper-Integrated Active Electrosense for Pre-Contact Sensing in Underwater Soft Grippers

面向水下软体夹爪预接触感知的夹爪集成主动电感知

Ahsan Tanveer, Muhammad Hamza, Waqar Hussain Afridi, Chen Wang, Guangming Xie

发表机构 * Intelligent Biomimetic Design Lab, School of Advanced Manufacturing and Robotics, State Key Laboratory for Turbulence and Complex Systems, College of Engineering, Peking University(智能仿生设计实验室,先进制造与机器人学院,湍流与复杂系统国家重点实验室,北京大学) National Engineering Research Center of Software Engineering, Peking University(软件工程国家工程研究中心,北京大学) Institute of Ocean Research, Peking University(海洋研究所,北京大学)

AI总结 针对水下视觉受限问题,提出一种集成于软体夹爪的主动电感知方法,通过测量导电介质中电场扰动实现预接触信号检测,实验表明多电极电压读数可检测物体引起的结构化变化。

Comments Extended abstract accepted to the IEEE ICRA 2026 Workshop on Manipulation Robustness

详情
AI中文摘要

水下操作通常发生在因浑浊、眩光和夹爪遮挡导致能见度降低的环境中,这限制了接近和抓取过程中基于视觉感知的可靠性。在这种情况下,软体夹爪非常适合顺应性交互,但通常缺乏在视觉不可靠时指导接近和闭合的机载预接触线索。本扩展摘要探索了主动电感知作为一种轻量级传感模式,通过测量导电介质中施加电场的扰动,在接触前提供类似接近的信号。我们为仿章鱼夹爪设计了离散电极布局,并使用现成硬件记录多通道传感电压。使用悬浮导电球进行的模拟和水槽实验显示,相对于空水基线,多电极电压读数出现了结构化的、依赖于物体的变化,且可检测性随5至20 V的激励和1 mHz至1 kHz的频率而变化。这些发现促使系统研究集成于夹爪的电感知作为水下软体操作补充预接触线索的可行性。

英文摘要

Underwater manipulation often occurs under degraded visibility due to turbidity, glare, and gripper occlusion, limiting the reliability of vision-based perception during approach and grasping. In such settings, soft grippers are well suited for compliant interaction, but they typically lack an onboard pre-contact cue that can guide approach and closure when vision is unreliable. This extended abstract explores active electrosense as a lightweight sensing modality that can provide a proximity-like signal prior to contact by measuring perturbations of an applied electric field in conductive media. We instrument an octopus-inspired gripper with a discrete electrode layout and record multi-channel sensing voltages using off-the-shelf hardware. Simulation and tank experiments with a suspended conductive sphere show structured, object-dependent changes in the multi-electrode voltage readout relative to empty-water baselines, with detectability varying across excitation of 5 to 20 V and frequencies from 1 mHz to 1 kHz. These findings motivate systematic investigation of gripper-integrated electrosense as a complementary pre-contact cue for underwater soft manipulation.

2606.03203 2026-06-03 cs.AI

MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

MedCUA-Bench: 一个仅基于截图的临床计算机使用代理基准测试

Jia Yu, Zilong Wang, Xinyang Jiang, Dongsheng Li, Shuo Wang

发表机构 * Microsoft Research Asia(微软亚洲研究院) Digital Medical Research Center, School of Basic Medical Sciences, Fudan University(复旦大学基础医学院数字医学研究中心) Shanghai Key Laboratory of MICCAI(上海MICCAI重点实验室)

AI总结 提出 MedCUA-Bench,一个覆盖10个医学领域18个临床场景的交互式基准,通过确定性检查器评估任务完成和五个临床安全维度,揭示当前代理在真实临床软件上的性能差距。

详情
AI中文摘要

计算机使用代理可以自动化重复的基于屏幕的临床工作,但它们在医疗图形用户界面中的可靠性仍未得到充分验证。现有的基准测试侧重于通用的网页或桌面任务,对医疗软件的覆盖不足,而医疗软件需要领域知识,其用户界面设计与主流应用显著不同,缺乏公开的测试环境,并且需要超出任务完成的安全验证。我们引入了 MedCUA-Bench,一个用于临床计算机使用代理的交互式基准测试。它涵盖了10个医学领域的18个临床场景,这些场景根据真实产品手册和开源医疗系统重建,以捕捉真实的临床界面,同时避免许可和隐私限制。每个任务都配有配对的意图级和步骤级目标,以将临床推理与用户界面执行分离,并通过确定性检查器在任务完成和五个临床安全维度上进行评估。在23个代理中,最好的闭源模型达到了54.2%的严格成功率,而所有模型在真实的 OpenEMR 上均低于9%。开源代理的平均成功率仅为2.5%,最好的达到了16.2%。MedCUA-Bench 揭示了当前代理与可靠临床软件使用之间的差距,为未来的研究提供了一个可复现的测试平台。

英文摘要

Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark for clinical computer-use agents. It covers 18 clinical scenarios across 10 medical domains, reconstructed from real product manuals and open-source medical systems to capture authentic clinical interfaces while avoiding licensing and privacy constraints. Each task ships with paired intent- and step-level goals to disentangle clinical reasoning from UI execution, and is evaluated by a deterministic checker over task completion and five clinical safety dimensions. Across 23 agents, the best closed-source model reaches 54.2% strict success, while all models remain below 9% on the real OpenEMR. Open-source agents average only 2.5%, with the best reaching 16.2%. MedCUA-Bench exposes the gap between current agents and reliable clinical software use, providing a reproducible testbed for future research.