arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2051
2506.02568 2026-06-11 cs.AI 版本更新

MLaGA: Multimodal Large Language and Graph Assistant

MLaGA: 多模态大语言与图助手

Dongzhe Fan, Yi Fang, Jiajin Liu, Djellel Difallah, Qiaoyu Tan

发表机构 * New York University(纽约大学) New York University Shanghai(纽约大学上海) New York University Brooklyn(纽约大学布鲁克林) Virginia Polytechnic Institute and State University(弗吉尼亚理工大学) New York University Abu Dhabi(纽约大学阿布扎克)

AI总结 提出MLaGA模型,通过结构感知多模态编码器和指令微调,将大语言模型扩展到多模态图数据,在监督和迁移学习任务中优于基线方法。

详情
AI中文摘要

大语言模型(LLMs)在推进图结构化数据分析方面展现了显著的功效。现有的基于LLM的图方法擅长将LLM适应于文本丰富的图,其中节点属性是文本描述。然而,它们在多模态图上的应用——其中节点与多种属性类型(如文本和图像)相关联——仍然未被充分探索,尽管这些图在现实场景中普遍存在。为了弥合这一差距,我们引入了多模态大语言与图助手(MLaGA),这是一种创新模型,巧妙地将LLM能力扩展到促进对复杂图结构和多模态属性的推理。我们首先设计了一个结构感知的多模态编码器,通过联合图预训练目标将文本和视觉属性对齐到统一空间中。随后,我们实现了一种多模态指令微调方法,通过轻量级投影仪将多模态特征和图结构无缝集成到LLM中。在多个数据集上的大量实验证明了MLaGA相对于领先基线方法的有效性,在监督和迁移学习场景下的各种图学习任务中均取得了优越性能。

英文摘要

Large Language Models (LLMs) have demonstrated substantial efficacy in advancing graph-structured data analysis. Prevailing LLM-based graph methods excel in adapting LLMs to text-rich graphs, wherein node attributes are text descriptions. However, their applications to multimodal graphs--where nodes are associated with diverse attribute types, such as texts and images--remain underexplored, despite their ubiquity in real-world scenarios. To bridge the gap, we introduce the Multimodal Large Language and Graph Assistant (MLaGA), an innovative model that adeptly extends LLM capabilities to facilitate reasoning over complex graph structures and multimodal attributes. We first design a structure-aware multimodal encoder to align textual and visual attributes within a unified space through a joint graph pre-training objective. Subsequently, we implement a multimodal instruction-tuning approach to seamlessly integrate multimodal features and graph structures into the LLM through lightweight projectors. Extensive experiments across multiple datasets demonstrate the effectiveness of MLaGA compared to leading baseline methods, achieving superior performance in diverse graph learning tasks under both supervised and transfer learning scenarios.

2506.01396 2026-06-11 cs.LG cs.CR stat.ML 版本更新

Mitigating Disparate Impact of Differentially Private Learning through Bounded Adaptive Clipping

通过有界自适应裁剪减轻差分隐私学习中的差异影响

Linzh Zhao, Aki Rehn, Mikko A. Heikkilä, Razane Tajeddine, Antti Honkela

发表机构 * Department of Computer Science, University of Helsinki(计算机科学系,赫尔辛基大学) Department of Electrical and Computer Engineering, American University of Beirut(电气与计算机工程系,贝鲁特美国大学)

AI总结 针对差分隐私学习中梯度裁剪对少数群体造成的不公平影响,提出有界自适应裁剪方法,通过引入可调下界防止过度梯度抑制,在Skewed和Fashion MNIST上最差类准确率提升超过10个百分点。

Comments TMLR camera-ready version

详情
AI中文摘要

差分隐私已成为隐私保护机器学习的基本框架。然而,现有的差分隐私学习方法通常对模型预测产生差异影响,例如对少数群体。梯度裁剪常用于差分隐私学习,但会抑制来自困难样本的较大梯度。我们表明,自适应裁剪会加剧这一问题,因为它通常会将裁剪边界缩小到极小值以匹配拟合良好的多数类,同时显著降低其他类的准确率。我们提出有界自适应裁剪,引入可调下界以防止过度梯度抑制。与无界自适应裁剪相比,我们的方法在Skewed和Fashion MNIST上将最差类准确率提高了超过10个百分点,与自动裁剪相比提高了7个百分点,与恒定裁剪相比提高了5个百分点。代码可在该 https URL 获取。

英文摘要

Differential privacy (DP) has become an essential framework for privacy-preserving machine learning. Existing DP learning methods, however, often have disparate impacts on model predictions, e.g., for minority groups. Gradient clipping, which is often used in DP learning, can suppress larger gradients from challenging samples. We show that this problem is amplified by adaptive clipping, which will often shrink the clipping bound to tiny values to match a well-fitting majority, while significantly reducing the accuracy for others. We propose bounded adaptive clipping, which introduces a tunable lower bound to prevent excessive gradient suppression. Our method improves worst-class accuracy by over 10 percentage points on Skewed and Fashion MNIST compared to unbounded adaptive clipping, 7 points compared to Automatic clipping, and 5 points compared to constant clipping. The code is available at https://github.com/TrustworthyMLHelsinki/adaptive-clipping-fairness.

2505.03296 2026-06-11 cs.RO cs.AI cs.LG 版本更新

The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning

离散时间高斯过程混合在机器人策略学习中的惊人有效性

Jan Ole von Hartz, Adrian Röfer, Joschka Boedecker, Abhinav Valada

发表机构 * Department of Computer Science, University of Freiburg, Germany(弗赖堡大学计算机科学系)

AI总结 提出MiDiGap方法,利用少量演示和相机观测,通过离散时间高斯过程混合实现机器人操作策略的灵活表示与模仿学习,在长时域、高约束、动态和多模态任务上取得SOTA性能,并支持推理时引导。

Comments Submitted for publication to IEEE Transaction on Robotics

详情
AI中文摘要

我们提出了离散时间高斯过程混合(MiDiGap),一种用于机器人操作中灵活策略表示和模仿学习的新方法。MiDiGap仅使用相机观测,即可从少至五次演示中学习,并在一系列具有挑战性的任务中泛化。它在长时域行为(如泡咖啡)、高约束运动(如开门)、动态动作(如用铲子舀取)和多模态任务(如挂杯子)上表现出色。MiDiGap在CPU上不到一分钟即可学习这些任务,并线性扩展到大型数据集。我们还开发了一套丰富的推理时引导工具,利用碰撞信号和机器人运动学约束等证据。这种引导实现了新颖的泛化能力,包括避障和跨本体策略迁移。MiDiGap在多样化的少样本操作基准上达到了最先进的性能。在受约束的RLBench任务上,它将策略成功率提高了76个百分点,并将轨迹成本降低了67%。在多模态任务上,它将策略成功率提高了48个百分点,并将样本效率提高了20倍。在跨本体迁移中,策略成功率提高了一倍以上。我们在以下网址公开了代码:https://this https URL。

英文摘要

We present Mixture of Discrete-time Gaussian Processes (MiDiGap), a novel approach for flexible policy representation and imitation learning in robot manipulation. MiDiGap enables learning from as few as five demonstrations using only camera observations and generalizes across a wide range of challenging tasks. It excels at long-horizon behaviors such as making coffee, highly constrained motions such as opening doors, dynamic actions such as scooping with a spatula, and multimodal tasks such as hanging a mug. MiDiGap learns these tasks on a CPU in less than a minute and scales linearly to large datasets. We also develop a rich suite of tools for inference-time steering using evidence such as collision signals and robot kinematic constraints. This steering enables novel generalization capabilities, including obstacle avoidance and cross-embodiment policy transfer. MiDiGap achieves state-of-the-art performance on diverse few-shot manipulation benchmarks. On constrained RLBench tasks, it improves policy success by 76 percentage points and reduces trajectory cost by 67%. On multimodal tasks, it improves policy success by 48 percentage points and increases sample efficiency by a factor of 20. In cross-embodiment transfer, it more than doubles policy success. We make the code publicly available at https://midigap.cs.uni-freiburg.de.

2504.12556 2026-06-11 cs.CV 版本更新

Contour Field based Elliptical Shape Prior for the Segment Anything Model

基于轮廓场的椭圆形状先验用于Segment Anything模型

Xinyu Zhao, Faqiang Wang, Li Cui, Yuping Duan, Jun Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对SAM难以高效生成椭圆形状分割结果的问题,提出一种参数化椭圆轮廓场约束方法,通过变分法和对偶算法将椭圆先验与图像特征融合,提升特定任务分割精度。

详情
AI中文摘要

椭圆形状先验信息在提高医学和自然图像特定任务的分割精度方面起着至关重要的作用。现有的基于深度学习的分割方法,包括Segment Anything模型(SAM),通常难以高效地生成具有椭圆形状的分割结果。本文提出了一种新方法,利用变分法将椭圆形状先验集成到基于深度学习的SAM图像分割技术中。该方法建立了一个参数化的椭圆轮廓场,约束分割结果与预定义的椭圆轮廓对齐。利用对偶算法,该模型将图像特征与椭圆先验和空间正则化先验无缝集成,从而大大提高了分割精度。通过将SAM分解为四个数学子问题,我们集成变分椭圆先验设计了一种新的SAM网络结构,确保SAM的分割输出由椭圆区域组成。在特定图像数据集上的实验结果表明,该方法优于原始SAM。

英文摘要

The elliptical shape prior information plays a vital role in improving the accuracy of image segmentation for specific tasks in medical and natural images. Existing deep learning-based segmentation methods, including the Segment Anything Model (SAM), often struggle to produce segmentation results with elliptical shapes efficiently. This paper proposes a new approach to integrate the prior of elliptical shapes into the deep learning-based SAM image segmentation techniques using variational methods. The proposed method establishes a parameterized elliptical contour field, which constrains the segmentation results to align with predefined elliptical contours. Utilizing the dual algorithm, the model seamlessly integrates image features with elliptical priors and spatial regularization priors, thereby greatly enhancing segmentation accuracy. By decomposing SAM into four mathematical sub-problems, we integrate the variational ellipse prior to design a new SAM network structure, ensuring that the segmentation output of SAM consists of elliptical regions. Experimental results on some specific image datasets demonstrate an improvement over the original SAM.

2503.22926 2026-06-11 cs.RO 版本更新

SR-LIO++: LiDAR-Inertial Odometry and Quantized Mapping with Caching-Aware Sweep Reconstruction

SR-LIO++: 基于缓存感知扫描重建的LiDAR-惯性里程计与量化建图

Zikang Yuan, Ruiye Ming, Chengwei Zhao, Yonghao Tan, Pingcheng Dong, Yuan Ren, Yuzhong Jiao, Xin Yang, Kwang-Ting Cheng

发表机构 * ACCESS-AI Chip Center for Emerging Smart Systems, InnoHK Centers, Hong Kong Science Park, Hong Kong, China(新兴智能系统 ACCESS-AI 芯片中心,InnoHK 中心,香港科学公园,香港,中国) HongKong University of Science and Technology, HongKong, China(香港理工大学,香港,中国) Huazhong University of Science and Technology, Wuhan, China(华中科技大学,武汉,中国) Hangzhou Qisheng Intelligent Techology Co. Ltd., Hangzhou, China(杭州启盛智能科技有限公司,杭州,中国) Southeast University, Nanjing, China(东南大学,南京,中国)

AI总结 提出SR-LIO++系统,通过扫描重建提高输出频率,并采用缓存机制和量化地图点管理,在资源受限平台上实现高精度、高效率的LiDAR-惯性里程计。

Comments 18 pages, 10 figures

详情
AI中文摘要

解决3D LiDAR固有低采集频率限制以实现高频输出已成为LiDAR-惯性里程计(LIO)领域的关键研究焦点。为确保实时性能,频率增强的LIO系统必须在显著缩短的时间窗口内处理每次扫描,这对在资源受限平台上的部署提出了巨大挑战。为解决这些限制,我们引入了SR-LIO++,一种创新的LIO系统,能够在资源受限的硬件平台(包括Raspberry Pi 4B)上实现相对于输入频率两倍的输出频率。我们的系统采用先前提出的扫描重建方法来提高LiDAR扫描频率,生成高频重建扫描。在此基础之上,我们提出了一种针对最近段中间结果(即表面参数)的缓存机制,有效减少了相邻重建扫描中公共段的冗余处理。该方法将处理时间从传统的对重建扫描频率的线性依赖中解耦出来。此外,我们提出了一种基于索引表映射的量化地图点管理,通过将全局3D点存储从64位双精度转换为8位字符表示,显著减少了内存使用。该方法还将最近邻搜索中计算密集的欧几里得距离计算从64位双精度转换为16位短整型和32位整型格式,降低了计算成本。在三个不同计算平台和四个公开数据集上的广泛实验评估表明,SR-LIO++在保持最先进精度的同时,显著提高了效率。值得注意的是,我们的系统在Raspberry Pi 4B硬件上成功实现了20 Hz的状态输出。

英文摘要

Addressing the inherent low acquisition frequency limitation of 3D LiDAR to achieve high-frequency output has become a critical research focus in the LiDAR-Inertial Odometry (LIO) domain. To ensure real-time performance, frequency-enhanced LIO systems must process each sweep within significantly reduced timeframe, which presents substantial challenges for deployment on resource-constrained platforms. To address these limitations, we introduce SR-LIO++, an innovative LIO system capable of achieving doubled output frequency relative to input frequency on resource-constrained hardware platforms, including the Raspberry Pi 4B. Our system employs the previously proposed sweep reconstruction methodology to enhance LiDAR sweep frequency, generating high-frequency reconstructed sweeps. Building upon this foundation, we propose a caching mechanism for intermediate results (i.e., surface parameters) of the most recent segments, effectively minimizing redundant processing of common segments in adjacent reconstructed sweeps. This method decouples processing time from the traditionally linear dependence on reconstructed sweep frequency. Furthermore, we present a quantized map point management based on index table mapping, significantly reducing memory usage by converting global 3D point storage from 64-bit double precision to 8-bit char representation. This method also converts the computationally intensive Euclidean distance calculations in nearest neighbor searches from 64-bit double precision to 16-bit short and 32-bit integer formats, reducing computational cost. Extensive experimental evaluations across three distinct computing platforms and four public datasets demonstrate that SR-LIO++ maintains state-of-the-art accuracy while substantially enhancing efficiency. Notably, our system successfully achieves 20 Hz state output on Raspberry Pi 4B hardware.

2503.10973 2026-06-11 cs.LG 版本更新

Learning Patterns and Abstractions from Perceptual Sequences

从感知序列中学习模式与抽象

Shuchen Wu

发表机构 * Mathematisch-Naturwissenschaftlichen Fakultät und der Medizinischen Fakultät der Eberhard-Karls-Universität Tübingen(图宾根大学数学自然科学学院和医学学院)

AI总结 研究从感知序列中通过分块和抽象发现模式与层次结构的计算原理,提出理性分块模型和非参数层次变量模型,实现高效序列分解与无监督模式发现。

Comments Doctoral thesis

详情
AI中文摘要

认知过程迅速将高维感官流分解为熟悉的部分并揭示它们之间的关系。结构为何出现?它们如何支持学习、泛化和预测?什么计算原理构成了感知和智能的这一核心方面?简化来说,感官流是一维序列。在学习此类序列时,我们自然地将其分割成部分——这一过程称为分块。在第一个项目中,我研究了在序列反应时任务中影响分块的因素,并表明人类在平衡速度和准确性的同时适应底层分块。在此基础上,我开发了学习分块并逐块解析序列的模型。从规范角度,我提出分块是一种理性策略,用于发现重复模式和嵌套层次结构,从而实现高效的序列分解。学习到的分块可作为可复用的原语,用于迁移、组合和心理模拟——使模型能够从已知中组合出新内容。我展示了该模型在单维和多维序列中学习层次结构的能力,并强调了其在无监督模式发现中的实用性。第二部分从具体序列转向抽象序列。我对抽象主题进行了分类,并考察了它们在序列记忆中的作用。行为证据表明,人类利用模式冗余进行压缩和迁移。我提出了一个非参数层次变量模型,该模型同时学习分块和抽象变量,揭示不变符号模式。我展示了其与人类学习的相似性,并与大型语言模型进行了比较。综上所述,本论文表明,分块和抽象作为简单的计算原理,能够支持从简单到复杂、从具体到抽象的层次组织序列中的结构化知识获取。

英文摘要

Cognition swiftly breaks high-dimensional sensory streams into familiar parts and uncovers their relations. Why do structures emerge, and how do they enable learning, generalization, and prediction? What computational principles underlie this core aspect of perception and intelligence? A sensory stream, simplified, is a one-dimensional sequence. In learning such sequences, we naturally segment them into parts -- a process known as chunking. In the first project, I investigated factors influencing chunking in a serial reaction time task and showed that humans adapt to underlying chunks while balancing speed and accuracy. Building on this, I developed models that learn chunks and parse sequences chunk by chunk. Normatively, I proposed chunking as a rational strategy for discovering recurring patterns and nested hierarchies, enabling efficient sequence factorization. Learned chunks serve as reusable primitives for transfer, composition, and mental simulation -- letting the model compose the new from the known. I demonstrated this model's ability to learn hierarchies in single and multi-dimensional sequences and highlighted its utility for unsupervised pattern discovery. The second part moves from concrete to abstract sequences. I taxonomized abstract motifs and examined their role in sequence memory. Behavioral evidence suggests that humans exploit pattern redundancies for compression and transfer. I proposed a non-parametric hierarchical variable model that learns both chunks and abstract variables, uncovering invariant symbolic patterns. I showed its similarity to human learning and compared it to large language models. Taken together, this thesis suggests that chunking and abstraction as simple computational principles enable structured knowledge acquisition in hierarchically organized sequences, from simple to complex, concrete to abstract.

2503.08445 2026-06-11 cs.RO 版本更新

iPack: Intuitive Bin Packing with Large Language Models

iPack: 基于大型语言模型的直观装箱

Yannik Blei, Michael Krawez, Adrian Göß, Devadas Vijayan Sheela, Tobias Jülg, Pierre Krack, Florian Walter, Wolfram Burgard

发表机构 * University of Technology Nuremberg(图恩大学) Friedrich-Alexander University of Erlangen–Nuremberg(埃尔兰根-纽伦堡弗里德里希-亚历山大大学) Technical University of Munich(慕尼黑技术大学)

AI总结 提出LLM-Pack方法,利用语言和视觉基础模型生成模仿人类策略的杂货装箱顺序,无需专门训练即可处理新物品,模块化设计便于升级。

Comments 7 Pages, 9 Figures

详情
AI中文摘要

机器人和自动化在物流中越来越有影响力,但仍主要局限于传统仓库。在杂货零售中,虽然存在无收银员超市等进步,但顾客仍然手动挑选和包装杂货。尽管机器人领域对箱子拾取问题有大量关注,但包装物品和杂货的任务基本上未被触及。然而,以正确顺序包装杂货物品对于防止产品损坏至关重要,例如,重物不应放在易碎物品之上。然而,正确包装顺序的确切标准很难定义,特别是考虑到商店中通常有大量各种物品。在本文中,我们介绍了LLM-Pack,一种新颖的杂货包装方法。LLM-Pack利用语言和视觉基础模型来识别杂货并生成模仿人类包装策略的包装序列。LLM-Pack不需要专门训练来处理新的杂货物品,其模块化设计允许轻松升级底层基础模型。我们广泛评估了我们的方法以展示其性能。我们将在本文发表后公开LLM-Pack的源代码。

英文摘要

Robotics and automation are increasingly influential in logistics but remain largely confined to traditional warehouses. In grocery retail, advancements such as cashier-less supermarkets exist, yet customers still manually pick and pack groceries. While there has been a substantial focus in robotics on the bin picking problem, the task of packing objects and groceries has remained largely untouched. However, packing grocery items in the right order is crucial for preventing product damage, e.g., heavy objects should not be placed on top of fragile ones. However, the exact criteria for the right packing order are hard to define, in particular given the huge variety of objects typically found in stores. In this paper, we introduce LLM-Pack, a novel approach for grocery packing. LLM-Pack leverages language and vision foundation models for identifying groceries and generating a packing sequence that mimics human packing strategy. LLM-Pack does not require dedicated training to handle new grocery items and its modularity allows easy upgrades of the underlying foundation models. We extensively evaluate our approach to demonstrate its performance. We will make the source code of LLMPack publicly available upon the publication of this manuscript.

2503.06578 2026-06-11 cs.RO cs.SY eess.SY 版本更新

Non-Equilibrium MAV-Capture-MAV via Time-Optimal Planning and Reinforcement Learning

非平衡MAV捕获MAV:基于时间最优规划和强化学习

Canlun Zheng, Zhanyu Guo, Zikang Yin, Chunyu Wang, Zhikun Wang, Shiyu Zhao

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China(浙江大学计算机科学与技术学院,中国杭州) WINDY Lab, Department of Artificial Intelligence, Westlake University, Hangzhou, China(西湖大学人工智能系WINDY实验室,中国杭州) Department of Electrical Engineering, California Institute of Technology, Pasadena, USA(加州理工学院电气工程系,美国帕萨迪纳)

AI总结 针对高机动性目标捕获难题,本文设计紧凑型捕获MAV,结合时间最优规划与强化学习方法,在非稳定状态下实现目标捕获。

详情
AI中文摘要

由于飞行MAV(微型飞行器)的捕获具有挑战性和广阔应用前景,近年来受到越来越多的研究关注。尽管已有进展,现有工作的一个关键限制是捕获策略通常相对简单且受平台性能约束。本文研究能够捕获高机动性目标的控制策略。在非稳定条件下实现目标捕获这一独特挑战使其区别于传统的追逃和制导问题。在本研究中,我们从较大的MAV平台过渡到一种专门设计的、配备定制发射装置的紧凑型捕获MAV,同时保持高机动性。我们探索了时间最优规划(TOP)和强化学习(RL)方法。仿真表明,TOP提供高机动性和更短的轨迹,而RL在实时适应性和稳定性方面表现优异。此外,RL方法已在真实场景中测试,成功实现了即使在非稳定状态下的目标捕获。

英文摘要

The capture of flying MAVs (micro aerial vehicles) has garnered increasing research attention due to its intriguing challenges and promising applications. Despite recent advancements, a key limitation of existing work is that capture strategies are often relatively simple and constrained by platform performance. This paper addresses control strategies capable of capturing high-maneuverability targets. The unique challenge of achieving target capture under unstable conditions distinguishes this task from traditional pursuit-evasion and guidance problems. In this study, we transition from larger MAV platforms to a specially designed, compact capture MAV equipped with a custom launching device while maintaining high maneuverability. We explore both time-optimal planning (TOP) and reinforcement learning (RL) methods. Simulations demonstrate that TOP offers highly maneuverable and shorter trajectories, while RL excels in real-time adaptability and stability. Moreover, the RL method has been tested in real-world scenarios, successfully achieving target capture even in unstable states.

2501.12942 2026-06-11 cs.AI 版本更新

Offline Diffusion Policy for Multi-User Delay-Constrained Scheduling

面向多用户延迟约束调度的离线扩散策略

Zhuoran Li, Ruishuo Chen, Hai Zhong, Longbo Huang

发表机构 * Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University(交叉信息学院(IIIS),清华大学)

AI总结 提出基于离线强化学习的SOCD算法,利用扩散策略和批评网络指导,从离线数据中学习高效调度策略,避免在线交互,在部分可观测和大规模环境中表现优异。

详情
AI中文摘要

有效的多用户延迟约束调度在诸多实际应用中至关重要,包括具身AI、即时通讯、直播和数据中心管理,这些场景需要在具有不同延迟敏感性的用户之间进行高效资源分配。在这些场景中,调度器必须实时做出决策,以满足延迟和资源约束,同时无需事先了解系统动态,这些动态通常是时变的且难以估计。当前基于学习的方法通常需要在训练阶段与实际系统进行在线交互。因此,这些方法往往难以实施或不切实际,因为它们会显著降低系统性能并产生高昂的服务成本。为应对这些挑战,我们提出了一种新颖的基于离线强化学习的算法,名为SOCD(通过离线学习与批评引导和扩散模型进行调度),该算法仅从预先收集的离线数据中学习高效调度策略。SOCD创新性地采用了扩散策略,并辅以无采样的批评网络进行策略引导。通过将拉格朗日乘子优化融入离线强化学习,SOCD仅从可用数据集中高效训练出高质量且满足约束的策略,无需与系统进行在线交互。实验结果表明,SOCD对多种系统动态具有鲁棒性,包括部分可观测和大规模环境,并且与现有方法相比性能更优。

英文摘要

Effective multi-user delay-constrained scheduling is crucial in various real-world applications, including embodied AI, instant messaging, live streaming, and data center management, where efficient resource allocation is required among users with diverse delay sensitivities. In these scenarios, schedulers must make real-time decisions to satisfy both delay and resource constraints without prior knowledge of system dynamics, which are often time-varying and challenging to estimate. {Current learning-based methods typically require online interactions with actual systems during the training stage. Therefore, these approaches are often difficult or impractical, as they can significantly degrade system performance and incur substantial service costs.} To address these challenges, we propose a novel offline reinforcement learning-based algorithm, named \underline{S}cheduling By \underline{O}ffline Learning with \underline{C}ritic Guidance and \underline{D}iffusion Model (SOCD), to learn efficient scheduling policies purely from pre-collected \emph{offline data}. SOCD innovatively employs a diffusion policy, complemented by a sampling-free critic network for policy guidance. By integrating the Lagrangian multiplier optimization into the offline reinforcement learning, SOCD efficiently trains high-quality constraint-aware policies exclusively from available datasets, eliminating the need for online interactions with the system. Experimental results demonstrate that SOCD is resilient to various system dynamics, including partially observable and large-scale environments, and delivers superior performance compared to existing methods.

2410.12327 2026-06-11 cs.CL 版本更新

Neuron-based Personality Trait Induction in Large Language Models

基于神经元的大语言模型人格特质诱导

Jia Deng, Tianyi Tang, Yanbin Yin, Wenhao Yang, Wayne Xin Zhao, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学首都人工智能学院) Tongyi Lab(通义实验室) Institute of Statistics and Big Data, Renmin University of China(中国人民大学统计与大数据研究院)

AI总结 提出基于神经元的大语言模型人格特质诱导方法,通过构建PersonalityBench数据集、识别人格相关神经元并调整其值,实现无需训练的参数级控制,效果媲美微调模型。

Comments 25 pages. Published at ICLR 2025

详情
Journal ref
The Thirteenth International Conference on Learning Representations (ICLR 2025), 2025
AI中文摘要

大型语言模型(LLMs)在模拟各种人格特质方面变得越来越熟练,这是支持相关应用(例如角色扮演)的重要能力。为了进一步提高这种能力,在本文中,我们提出了一种基于神经元的方法,用于LLMs的人格特质诱导,包含三个主要技术贡献。首先,我们构建了PersonalityBench,一个用于识别和评估LLMs人格特质的大规模数据集。该数据集基于心理学中的大五人格特质,旨在评估LLMs对特定人格特质的生成能力。其次,通过利用PersonalityBench,我们提出了一种高效的方法,通过检查给定特质的对立面来识别LLMs中与人格相关的神经元。第三,我们开发了一种简单而有效的诱导方法,通过操纵这些已识别的人格相关神经元的值。该方法无需训练和修改模型参数即可实现对LLMs所表现特质的细粒度控制。大量实验验证了我们神经元识别和特质诱导方法的有效性。值得注意的是,我们的方法实现了与微调模型相当的性能,为LLMs的人格特质诱导提供了更高效、更灵活的解决方案。我们在以下网址提供所有提到的资源:此 https URL。

英文摘要

Large language models (LLMs) have become increasingly proficient at simulating various personality traits, an important capability for supporting related applications (e.g., role-playing). To further improve this capacity, in this paper, we present a neuron-based approach for personality trait induction in LLMs, with three major technical contributions. First, we construct PersonalityBench, a large-scale dataset for identifying and evaluating personality traits in LLMs. This dataset is grounded in the Big Five personality traits from psychology and is designed to assess the generative capabilities of LLMs towards specific personality traits. Second, by leveraging PersonalityBench, we propose an efficient method for identifying personality-related neurons within LLMs by examining the opposite aspects of a given trait. Third, we develop a simple yet effective induction method that manipulates the values of these identified personality-related neurons. This method enables fine-grained control over the traits exhibited by LLMs without training and modifying model parameters. Extensive experiments validate the efficacy of our neuron identification and trait induction methods. Notably, our approach achieves comparable performance as fine-tuned models, offering a more efficient and flexible solution for personality trait induction in LLMs. We provide access to all the mentioned resources at https://github.com/RUCAIBox/NPTI.

2409.18478 2026-06-11 cs.CV 版本更新

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks

Temporal2Seq: 一个面向时序视频理解任务的统一框架

Min Yang, Zichen Zhang, Qian Dang, Limin Wang

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(新型软件技术国家重点实验室,南京大学) China Design Group Co., Ltd(中国设计集团有限公司)

AI总结 提出Temporal2Seq统一框架,将时序视频理解任务输出表示为离散token序列,通过单一架构训练通用模型,在TAD、TAS、GEBD三个任务上取得合理结果,并优于单任务训练。

Comments Accepted by CVIU

详情
AI中文摘要

随着视频理解的发展,出现了大量用于片段级时序视频分析的任务,包括时序动作检测(TAD)、时序动作分割(TAS)和通用事件边界检测(GEBD)。尽管针对特定任务的视频理解模型在每个任务上都表现出色,但仍然缺乏一个能够同时处理多个任务的统一框架,这是下一代人工智能的一个有前景的方向。为此,在本文中,我们提出了一个单一的统一框架,称为Temporal2Seq,将这些时序视频理解任务的输出表述为离散token序列。通过这种统一的token表示,Temporal2Seq可以在单一架构内训练一个通用模型,用于不同的视频理解任务。在没有多任务学习(MTL)基准的情况下,我们通过借用TAD、TAS和GEBD任务的数据集,编制了一个全面的联合训练数据集。我们在三个任务的相应测试集上评估了我们的Temporal2Seq通用模型,结果表明Temporal2Seq能够在各种任务上产生合理的结果,并且在该框架上相比单任务训练具有优势。我们还研究了通用模型在不同任务的新数据集上的泛化性能,其表现优于特定模型。

英文摘要

With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL) benchmarks, we compile a comprehensive co-training dataset by borrowing the datasets from TAD, TAS, and GEBD tasks. We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks and achieve advantages compared with single-task training on this framework. We also investigate the generalization performance of our generalist model on new datasets from different tasks, which yields superior performance to the specific model.

2307.01472 2026-06-11 cs.AI cs.LG cs.MA 版本更新

Improving Generalization and Data Efficiency with Diffusion in Offline Multi-agent RL

通过扩散模型提升离线多智能体强化学习的泛化能力与数据效率

Zhuoran Li, Ling Pan, Jiatai Huang, Longbo Huang

发表机构 * Institute for Interdisciplinary Information Sciences(交叉信息学院) Tsinghua University(清华大学) Department of Electronic and Computer Engineering(电子与计算机工程系) Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 提出扩散离线多智能体模型(DOM2),利用扩散模型增强策略表达力和多样性,结合轨迹数据重加权,在离线MARL中显著提升性能、泛化能力和数据效率。

详情
AI中文摘要

我们提出了一种新颖的扩散离线多智能体模型(DOM2),用于离线多智能体强化学习(MARL)。与主要依赖策略设计中保守性的现有算法不同,DOM2基于扩散模型增强了策略的表达力和多样性。具体来说,我们将扩散模型融入策略网络,并在训练中提出了一种基于轨迹的数据重加权方案。这些关键要素显著提高了算法对环境变化的鲁棒性,并在性能、泛化和数据效率方面取得了显著提升。我们的大量实验结果表明,DOM2在所有多智能体粒子和多智能体MuJoCo环境中均优于现有最先进方法,并且由于其高表达力和多样性,在迁移环境中(在评估的30个设置中有28个)泛化能力显著更强。此外,DOM2具有超高的数据效率,与现有算法相比,实现相同性能所需数据不超过5%(数据效率提升20倍)。

英文摘要

We present a novel Diffusion Offline Multi-agent Model (DOM2) for offline Multi-Agent Reinforcement Learning (MARL). Different from existing algorithms that rely mainly on conservatism in policy design, DOM2 enhances policy expressiveness and diversity based on diffusion model. Specifically, we incorporate a diffusion model into the policy network and propose a trajectory-based data-reweighting scheme in training. These key ingredients significantly improve algorithm robustness against environment changes and achieve significant improvements in performance, generalization and data-efficiency. Our extensive experimental results demonstrate that DOM2 outperforms existing state-of-the-art methods in all multi-agent particle and multi-agent MuJoCo environments, and generalizes significantly better to shifted environments {(in $28$ out of $30$ settings evaluated)} thanks to its high expressiveness and diversity. Moreover, DOM2 is ultra data efficient and requires no more than $5\%$ data for achieving the same performance compared to existing algorithms (a $20\times$ improvement in data efficiency).

2305.06145 2026-06-11 cs.CV 版本更新

Causal Clothes-Invariant Feature Learning for Cloth-Changing Person Re-ID

因果衣物不变特征学习用于换衣行人重识别

Xulin Li, Yan Lu, Bin Liu, Jiaze Li, Yating Liu, Qi Chu, Mang Ye, Wanli Ouyang, Nenghai Yu

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China(中国科学技术大学网络安全学院) Anhui Province Key Laboratory of Digital Security(安徽省数字安全重点实验室) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) School of Data Science, University of Science and Technology of China(中国科学技术大学数据科学学院) Wuhan University(武汉大学)

AI总结 针对换衣行人重识别中衣物变化导致特征失效的问题,提出因果衣物不变学习(CCIL),通过因果干预阻断衣物捷径,实现衣物不变特征学习,在PRCC和DeepChange数据集上分别达到66.4%和59.2%的Rank-1准确率。

详情
AI中文摘要

在换衣行人重识别(CCReID)中,学习衣物不变特征至关重要,这些特征能提供对衣物变化保持鲁棒的判别性ID特征。然而,当前存在的虚假相关性限制了现有ReID方法有效提取这些衣物不变特征。这种虚假相关性源于衣物归属:衣物很少在不同身份间共享,因此模型倾向于记忆衣物线索进行身份识别,这种策略对未见过的衣物泛化能力差。本文提出因果衣物不变学习(CCIL),将CC-ReID从似然学习P(Y|X)显式转换为因果干预学习P(Y|do(X))以阻断衣物捷径。CCIL通过三个模块实现这种干预:混淆字典、干预模块和解耦正则化。基于因果关系的建模使整个模型自然具有衣物不变性,有效防止特征学习中捕获虚假相关性。大量实验验证了CCIL的有效性。在PRCC和DeepChange数据集上,CCIL分别达到66.4%和59.2%的Rank-1准确率,比现有最优方法分别高出1.4和4.1个百分点。

英文摘要

In cloth-changing person re-identification (CCReID), it is critical to learn clothes-invariant feature, which can provide discriminative ID features that remain robust against clothing changes. However, a spurious correlation currently limits existing ReID methods from effectively extracting these clothing-invariant features. This spurious correlation arises from clothing ownership: clothing is rarely shared across different identities, so models tend to memorize clothing cues for identity recognition, and this strategy generalizes poorly to unseen clothing. In this paper, we propose Causal Clothes-Invariant Learning (CCIL), which explicitly shifts CC-ReID from likelihood learning P (Y|X) to causal intervention learning P (Y|do(X)) to block the clothing shortcut. CCIL realizes this intervention through three modules: a Confounder Dictionary, an Intervention Module, and Disentangle Regularization. The causality-based modeling makes the entire model naturally clothes-invariant, effectively preventing the capture of spurious correlations in feature learning. Extensive experiments validate the effectiveness of CCIL. On PRCC and DeepChange datasets, CCIL achieves Rank-1 accuracies of 66.4% and 59.2%, outperforming state-of-the-art methods by 1.4 and 4.1 percentage points, respectively.

2605.26938 2026-06-11 cs.AI math.OC

Developing a Totally Unimodular Linear Program for Optimal Conformance Checking: When and Why It Complements A*

开发用于最优合规性检查的全幺模线性规划:何时以及为何它补充A*

Izack Cohen

发表机构 * Bar Ilan University(巴伊兰大学)

AI总结 提出将基于对齐的合规性检查重新表述为在全幺模线性规划上的问题,利用网络流结构保证整数最优解,实验表明在长轨迹和有偏差情况下显著加速A*。

Comments Author-accepted manuscript accepted for publication in Expert Systems with Applications. Code and experiment scripts are available at: https://github.com/Izack-Cohen/unimodular-conformance-checking. Version corresponding to the accepted paper: v1.0.0

详情
Journal ref
Expert Systems with Applications, Volume 331, Part A, 2026, 133021
AI中文摘要

基于对齐的合规性检查是比较观察到的过程执行与规范过程模型的最先进方法。标准的精确解依赖于基于A*的启发式搜索,在存在长轨迹或大量偏差时可能表现出指数级运行时间。本文介绍了将基于对齐的合规性检查重新表述为定义在同步积的可达图上的全幺模线性规划(LP)。通过利用底层的网络流结构,所提出的公式通过LP松弛保证了整数最优极值点的存在,从而避免了与整数变量和分支定界搜索相关的组合开销。我们在来自真实世界和合成基准数据集的超过210万个合规性检查实例上进行了广泛的实证评估。结果表明,A*和LP方法表现出互补的性能特征:前者在短且符合良好的轨迹上表现最佳,而LP公式为具有偏差的较长轨迹提供了显著的加速,这正是合规性检查最具信息量的地方。基于这些发现,我们推导出结合两种方法的简单算法选择指南,与始终使用A*相比,实现了平均38.6%的运行时间节省和96%的选择准确率。

英文摘要

Alignment-based conformance checking is the state-of-the-art approach for comparing observed process executions with normative process models. The standard exact solution relies on an A*-based heuristic search, which can exhibit exponential runtime in the presence of long traces or substantial deviations. This paper introduces a reformulation of alignment-based conformance checking as a totally unimodular linear program (LP) defined on the reachability graph of the synchronous product. By exploiting the underlying network-flow structure, the proposed formulation guarantees the existence of an integral optimal extreme-point solution through LP relaxation, thereby avoiding the combinatorial overhead associated with integer variables and branch-and-bound search. We conduct an extensive empirical evaluation on more than 2.1 million conformance checking instances derived from real-world and synthetic benchmark datasets. The results show that A* and the LP approach exhibit complementary performance characteristics: the former performs best on short, well-conforming traces, while the LP formulation provides substantial speedups for longer traces with deviations, precisely where conformance checking is most informative. Based on these findings, we derive simple algorithm-selection guidelines that combine both approaches, achieving average runtime savings of 38.6% with 96% selection accuracy compared to always using A*.

2605.29355 2026-06-11 cs.LG q-bio.NC

Neural-Behavioral Representation of Natural Whole-body Movement in Monkeys

猴子自然全身运动的神经-行为表征

Jieshi He, Puzhe Li, Yanan Sui, Mu-ming Poo

发表机构 * Center for Excellence in Brain Science and Intelligence Technology, CAS(脑科学与智能技术 excellence 中心,中国科学院) Tsinghua University(清华大学)

AI总结 通过大规模皮层信号与多视角运动捕捉,结合自回归编码器-解码器模型,实现了对自由运动猴子全身运动的准确解码。

详情
AI中文摘要

理解皮层活动如何表征灵长类动物的自然全身行为仍然具有挑战性。受限于运动的多样性和全身运动学大规模神经表征的不可及性,先前的运动解码研究集中于受限任务和有限的肢体运动。在这里,我们提出了一个用于自由运动猴子的神经-行为记录和建模框架,通过定制的数据采集平台,将来自分布式感觉和运动相关区域的大规模硬膜外皮层信号与同步的多视角运动捕捉相结合。我们重建了猴子的全身运动学,并使用自回归编码器-解码器模型学习了紧凑的行为先验。以神经信号为条件,该模型在没有明确物理约束的情况下解码出准确且逼真的全身运动。我们的结果为利用大规模颅内神经活动解码灵长类动物的自然全身运动提供了一种新颖的概念验证方法。

英文摘要

Understanding how cortical activity represents natural whole-body behaviors in primates remains challenging. Limited by the diversity of movements and inaccessibility of large-scale neural representation of whole-body kinematics, previous motor decoding studies focused on constrained tasks and limited limb movements. Here, we present a neural-behavioral recording and modeling framework for freely moving monkeys, combining large-scale epidural cortical signals from distributed sensory- and motor-related areas with synchronized multi-view motion capture through a custom-made data collection platform. We reconstructed whole-body monkey kinematics and learned a compact behavior prior using an autoregressive encoder-decoder model. Conditioned on neural signals, the model decoded accurate and realistic whole-body movement without explicit physical constraints. Our results provide a novel proof-of-concept approach for decoding natural whole-body movements in primates using large-scale intracranial neural activity.

2605.29292 2026-06-11 cs.CV

Turbulence-Robust Dynamic Object Segmentation with Multi-Signal Priors and SAM2 Refinement

基于多信号先验和SAM2优化的湍流鲁棒动态目标分割

Bolian Peng, Ying Tang, Xu Liu, Long Sun, Xiaoqiang Lu

发表机构 * Xidian University(西安电子科技大学)

AI总结 提出一种无需训练的多信号分割流水线,结合RAFT运动估计、DINOv2语义先验、ViBe背景建模和SAM2掩码优化,解决大气湍流下的动态目标分割问题。

详情
Journal ref
Proceedings of the CVPR 2026 Workshops, UG2+ Challenge, 2026
AI中文摘要

本技术报告介绍了我们针对CVPR 2026 UG2+挑战赛第三赛道:湍流中动态目标分割(DOST)的解决方案。我们设计了一种无需训练的多信号分割流水线,结合了预训练的运动估计、自监督语义先验、背景异常建模、手动校准的提议融合以及基于SAM2的掩码优化。该方法使用RAFT获取密集运动响应,DINOv2获取语义目标先验,ViBe进行无需训练的背景建模,以及预训练的SAM2进行框提示掩码优化。 我们的系统完全在推理模式下运行,而不是优化端到端的分割网络。这种设计适用于DOST场景,其中严重的大气湍流会产生伪运动、模糊和间歇性目标可见性,使得单一运动线索不可靠。最终提交的掩码由官方排行榜评估,报告了0.425041 mIoU和0.457206 mDice。由于没有进行特定任务的模型训练或微调,更强的学习时间关联、自适应提议选择或任务特定适应可能进一步改进系统。

英文摘要

This technical report presents our solution for the CVPR 2026 UG2+ Challenge Track 3: Dynamic Object Segmentation in Turbulence (DOST). We design a training-free multi-signal segmentation pipeline that combines pretrained motion estimation, self-supervised semantic priors, background anomaly modeling, manually calibrated proposal fusion, and SAM2-based mask refinement. The method uses RAFT for dense motion responses, DINOv2 for semantic objectness priors, ViBe for training-free background modeling, and pretrained SAM2 for box-prompt mask refinement. Instead of optimizing an end-to-end segmentation network, our system operates entirely in inference mode. This design is suitable for the DOST setting, where severe atmospheric turbulence produces pseudo-motion, blur, and intermittent target visibility, making a single motion cue unreliable. The final submitted masks are evaluated by the official leaderboard, which reports 0.425041 mIoU and 0.457206 mDice. Since no task-specific model training or fine-tuning is performed, stronger learned temporal association, adaptive proposal selection, or task-specific adaptation may further improve the system.

2605.17773 2026-06-11 cs.CV

PlantPose: Universal Plant Skeleton Estimation via Tree-constrained Graph Generation

PlantPose: 通过树约束图生成实现通用植物骨架估计

Xinpeng Liu, Hiroaki Santo, Yosuke Toda, Fumio Okura

发表机构 * Graduate School of Information Science and Technology, Osaka University(大阪大学信息科学与技术研究生院) Phytometrics(Phytometrics公司) Institute of Transformative Bio-Molecules, Nagoya University(名古屋大学变革生物分子研究所)

AI总结 本文提出PlantPose,一种通过树约束图生成实现通用植物骨架估计的方法,通过结合学习基于图生成和传统图算法,提高模型的泛化能力,并在多个领域实现了鲁棒且准确的植物骨架估计。

Comments International Journal of Computer Vision, 2026

详情
AI中文摘要

准确地从图像中估计植物骨架结构(例如分支结构)对于智能农业和植物科学至关重要。与人类骨骼固定拓扑结构不同,植物骨架估计面临独特的挑战,即从图像中估计任意树状图。为了解决这个问题,我们介绍了PlantPose,一种通过树约束图生成实现的通用植物骨架估计器。PlantPose结合了基于学习的图生成与传统图算法,在训练循环中强制执行树约束。为了提高模型的泛化能力,我们精心编排了一个包含真实世界和合成植物图像以及简化表示(例如草图和抽象画)的大型多样化数据集。该数据集使通用模型能够适应各种输入样式和植物图像类别,同时保持拓扑一致性。我们的方法在多个领域实现了鲁棒且准确的植物骨架估计,包括之前未见过的域外场景。进一步的分析突显了该方法在处理复杂、异质数据分布方面的优势和局限性。所有实现和数据集均在https://github.com/huntorochi/PlantPose/上提供。

英文摘要

Accurate estimation of plant skeletal structures (e.g., branching structures) from images is essential for smart agriculture and plant science. Unlike human skeletons with fixed topology, plant skeleton estimation presents a unique challenge, i.e., estimating arbitrary tree graphs from images. To address this problem, we introduce PlantPose, a universal plant skeleton estimator via tree-constrained graph generation. PlantPose combines learning-based graph generation with traditional graph algorithms to enforce tree constraints during the training loop. To enhance the model's generalization capability, we curate a large and diverse dataset comprising real-world and synthetic plant images, along with simplified representations (e.g., sketches and abstract drawings). This dataset enables the generalized model to adapt to diverse input styles and categories of plant images while preserving topological consistency. Our approach demonstrates robust and accurate plant skeleton estimation across multiple domains, including previously unseen out-of-domain scenarios. Further analyses highlight the method's strengths and limitations in handling complex, heterogeneous data distributions. All implementations and datasets are available at https://github.com/huntorochi/PlantPose/.

2604.01383 2026-06-11 cs.CV cs.AI

GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization

GRAZE:基于 grounded 的细化与运动感知的零样本事件定位

Syed Ahsan Masud Zaidi, Lior Shamir, William Hsu, Scott Dietrich, Talha Zaidi

发表机构 * Kansas State University(堪萨斯州立大学) Albright College(阿尔比恩学院)

AI总结 本文提出GRAZE,一种无需标注数据的零样本事件定位方法,通过结合Grounding DINO和SAM2实现运动感知的接触定位,有效应对复杂场景。

Comments 9 pages, 5 figures, accepted to the CVPR 2026 Workshop on Computer Vision in Sports (CVSports) code: https://github.com/AhsanZaidi12/GRAZE

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 10087-10095, June 2026
AI中文摘要

美式足球训练生成大量视频,但感兴趣的互动仅占据每个长视频的短暂窗口。可靠的生物力学分析依赖于时空定位,以识别交互实体和接触开始。我们研究了First Point of Contact(FPOC),即球员首次触碰假人帧。在无约束训练视频中,我们提出了GRAZE,一种无需标注数据的FPOC定位管道。GRAZE使用Grounding DINO发现候选玩家-假人交互,通过运动感知的时间推理进行细化,并使用SAM2作为显式的像素级接触验证器。这种分离候选发现和接触确认的方法使方法在复杂场景和不稳定接地情况下更具鲁棒性。在738个打击练习视频上,GRAZE在97.4%的剪辑中产生有效输出,在77.5%的剪辑中将FPOC定位在±10帧内,在82.7%的剪辑中在±20帧内。这些结果表明,在真实世界训练视频中实现精确帧接触开始定位在没有任务特定训练的情况下是可行的。

英文摘要

American football practice generates video at scale, yet the interaction of interest occupies only a brief window of each long, untrimmed clip. Reliable biomechanical analysis, therefore, depends on spatiotemporal localization that identifies both the interacting entities and the onset of contact. We study First Point of Contact (FPOC), defined as the first frame in which a player physically touches a tackle dummy, in unconstrained practice footage with camera motion, clutter, multiple similarly equipped athletes, and rapid pose changes around impact. We present GRAZE, a training-free pipeline for FPOC localization that requires no labeled tackle-contact examples. GRAZE uses Grounding DINO to discover candidate player-dummy interactions, refines them with motion-aware temporal reasoning, and uses SAM2 as an explicit pixel-level verifier of contact rather than relying on detection confidence alone. This separation between candidate discovery and contact confirmation makes the approach robust to cluttered scenes and unstable grounding near impact. On 738 tackle-practice videos, GRAZE produces valid outputs for 97.4% of clips and localizes FPOC within $\pm$ 10 frames on 77.5% of all clips and within $\pm$ 20 frames on 82.7% of all clips. These results show that frame-accurate contact onset localization in real-world practice footage is feasible without task-specific training.

2511.20216 2026-06-11 cs.AI cs.CE cs.CV cs.LG cs.RO

CostNav: A Navigation Benchmark for Real-World Economic-Cost Evaluation of Physical AI Agents

CostNav:一个用于现实世界经济成本评估的物理AI代理导航基准

Haebin Seong, Sungmin Kim, Yongjun Cho, Myunchul Joe, Geunwoo Kim, Yubeen Park, Sunhoo Kim, Samwoo Seong, Yoonshik Kim, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Jinmyung Kwak, Sunghee Ahn, Jaemin Lee, Younggil Do, Seungyeop Yi, Woojin Cheong, Minhyeok Oh, Minchan Kim, Seongjae Kang, Youngjae Yu, Yunsung Lee

发表机构 * KAIST(韩国国立科学技术院) University of California, Irvine(加州大学 Irvine 分校) Seoul National University(首尔国立大学)

AI总结 CostNav引入了一个经济导航基准,通过结合物理模拟和行业数据,评估AI代理的经济可行性,发现高任务成功率并不保证经济性,CANVAS在非零SLA合规性下表现最佳。

详情
AI中文摘要

当前导航基准侧重于任务成功率,但未捕捉到商业化自主配送系统所需的关键经济约束。我们引入了CostNav,一个经济导航基准,通过Isaac Sim的碰撞和货物动力学与行业标准数据如证券交易委员会(SEC)文件和简化伤害分级(AIS)伤害报告相结合,评估物理AI代理的成本收益和盈亏分析。我们发现,高任务成功率并不保证经济可行性。评估七种基线方法(两种基于规则和五种模仿学习方法)后,发现无方法经济可行:所有方法均产生负贡献边际。CANVAS仅使用RGB相机和GPS,在非零服务等级协议(SLA)合规性下获得最高任务成功率和最不负面的边际(-28.40/次),优于配备LiDAR的Nav2 w/ GPS(-37.34/次)。一个在模拟中训练的策略在真实配送机器人上评估时,SLA合规性接近其模拟结果,表明CostNav模拟中的策略性能可以转移到现实部署中。我们挑战社区在CostNav上实现经济可行性,该基准通过成本收益结果评分所有方法。所有资源均在https://github.com/worv-ai/CostNav上提供。

英文摘要

Current navigation benchmarks focus on task success but do not capture the economic constraints essential for commercializing autonomous delivery systems. We introduce CostNav, an Economic Navigation Benchmark that evaluates physical AI agents on a cost-revenue and break-even analysis, pairing Isaac Sim's collision and cargo dynamics with industry-standard data such as Securities and Exchange Commission (SEC) filings and Abbreviated Injury Scale (AIS) injury reports. To our knowledge, CostNav is the first physics-grounded economic benchmark to use regulatory and financial data to quantify the gap between navigation metrics and commercial deployment, revealing that high task-success rates alone do not ensure economic viability. Evaluating seven baselines (two rule-based and five imitation-learning methods), we find no method economically viable: all yield negative contribution margins. CANVAS, using only an RGB camera and GPS, attains the highest task success and the least-negative margin among methods with non-zero Service-Level Agreement (SLA) compliance (-\$28.40/run), outperforming LiDAR-equipped Nav2 w/ GPS (-\$37.34/run). A sim-trained policy evaluated on a real delivery robot yields SLA compliance close to its simulation result, indicating that policy performance in CostNav's simulation transfers to real-world deployment. We challenge the community to achieve economic viability on CostNav, which scores methods by cost-revenue outcomes. All resources are available at https://github.com/worv-ai/CostNav.

2511.08113 2026-06-11 cs.CL

Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

多模态大语言模型在跨模态间无法最优组合技能

Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune

发表机构 * arXiv

AI总结 本文研究多模态大语言模型跨模态技能组合能力,发现其存在显著差距,并提出链式推理和微调策略以改善,但效果有限。

详情
AI中文摘要

技能组合是指将已学习的技能组合以解决新任务。随着神经网络在预训练中获得越来越复杂的技能,不清楚它们能多好地组合这些技能。本文聚焦多模态大语言模型(MLLM),研究其跨模态技能组合能力。为此,我们设计了三个评估任务,可按顺序组合两个模态依赖的技能,并在两种主要设置下评估多个开源MLLM:i)直接提示模型解决问题,和ii)使用两步级联推理方法,手动强制组合两个技能。即使在这些简单的组合中,我们发现所有评估的MLLM都表现出显著的跨模态技能组合差距。为缓解上述差距,我们探索了两种替代方案:i)使用链式推理提示明确指导MLLM进行技能组合,和ii)一种特定的微调配方以促进技能组合。尽管这些策略提高了模型性能,但它们仍然表现出显著的技能组合差距,表明需要更多研究来改进MLLM中的跨模态技能组合。

英文摘要

Skill composition is the ability to combine previously learned skills to solve new tasks. As neural networks acquire increasingly complex skills during their pretraining, it is not clear how successfully they can compose them. In this paper, we focus on Multimodal Large Language Models (MLLM), and study their ability to compose skills across modalities. To this end, we design three evaluation tasks which can be solved sequentially composing two modality-dependent skills, and evaluate several open MLLMs under two main settings: i) prompting the model to directly solve the task, and ii) using a two-step cascaded inference approach, which manually enforces the composition of the two skills for a given task. Even with these straightforward compositions, we find that all evaluated MLLMs exhibit a significant cross-modality skill composition gap. To mitigate the aforementioned gap, we explore two alternatives: i) use chain-of-thought prompting to explicitly instruct MLLMs for skill composition and ii) a specific fine-tuning recipe to promote skill composition. Although those strategies improve model performance, they still exhibit significant skill composition gaps, suggesting that more research is needed to improve cross-modal skill composition in MLLMs.

2506.22141 2026-06-11 cs.CL cs.IR

DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval

DAPFAM: 一个领域感知的家族级数据集用于跨领域专利检索基准测试

Iliass Ayaou, Denis Cavallucci, Hicham Chibane

发表机构 * Institut National des Sciences de l'Univers, Strasbourg(斯特拉斯堡国家科学大学)

AI总结 DAPFAM通过引入基于IPC3重叠的新方案,构建了包含1247个查询家族和45336个目标家族的家族级基准数据集,通过249次实验揭示跨领域检索中的显著领域差距,展示了段落级检索和密集方法的效果。

详情
AI中文摘要

专利先例检索在相关披露跨技术边界时尤其具有挑战性。现有基准缺乏显式的领域划分,难以评估检索系统如何应对这种转变。我们引入DAPFAM,一个具有显式IN域和OUT域划分的家族级基准,通过新的IPC3重叠方案定义。数据集包含1,247个查询家族和45,336个目标家族,以家族级聚合减少国际冗余,基于引用进行相关性判断。我们进行了249次受控实验,涵盖词汇(BM25)和密集(transformer)后端、文档和段落级检索、多种查询和文档表示、聚合策略以及通过反向排名融合(RRF)的混合融合。结果揭示了显著的领域差距:OUT域性能在所有配置中始终比IN域低约五倍。段落级检索始终优于文档级,密集方法在BM25上提供小幅提升,但无法缩小OUT域差距。文档级RRF在效果效率权衡上表现强劲,且具有最小的开销。通过揭示跨领域检索的持续挑战,DAPFAM提供了一个可重复、计算意识的测试床,用于开发更稳健的专利信息检索系统。数据集可在huggingface上公开获取:https://huggingface.co/datasets/datalyes/DAPFAM_patent。

英文摘要

Patent prior-art retrieval becomes especially challenging when relevant disclosures cross technological boundaries. Existing benchmarks lack explicit domain partitions, making it difficult to assess how retrieval systems cope with such shifts. We introduce DAPFAM, a family-level benchmark with explicit IN-domain and OUT-domain partitions defined by a new IPC3 overlap scheme. The dataset contains 1,247 query families and 45,336 target families aggregated at the family level to reduce international redundancy, with citation based relevance judgments. We conduct 249 controlled experiments spanning lexical (BM25) and dense (transformer) backends, document and passage level retrieval, multiple query and document representations, aggregation strategies, and hybrid fusion via Reciprocal Rank Fusion (RRF). Results reveal a pronounced domain gap: OUT-domain performance remains roughly five times lower than IN-domain across all configurations. Passage-level retrieval consistently outperforms document-level, and dense methods provide modest gains over BM25, but none close the OUT-domain gap. Document-level RRF yields strong effectiveness efficiency trade-offs with minimal overhead. By exposing the persistent challenge of cross-domain retrieval, DAPFAM provides a reproducible, compute-aware testbed for developing more robust patent IR systems. The dataset is publicly available on huggingface at https://huggingface.co/datasets/datalyes/DAPFAM_patent.

2510.09885 2026-06-11 cs.CL cs.AI

Diffusion-Inspired Masked Fine-Tuning for Knowledge Injection in Autoregressive LLMs

受扩散启发的掩码微调用于自回归大语言模型中的知识注入

Xu Pan, Ely Hahami, Jingxuan Fan, Ziqian Xie, Haim Sompolinsky

发表机构 * Harvard University(哈佛大学) University of Texas Health Science Center at Houston(德克萨斯大学健康科学中心休斯顿分校) Hebrew University(希伯来大学)

AI总结 本文提出一种掩码微调方法,通过重构原始文本提升自回归大语言模型的知识注入能力,无需依赖改写并克服反向诅咒,实验证明其在知识密集型任务中表现优异。

详情
AI中文摘要

大型语言模型(LLMs)常用于事实不断变化的环境,但通过在非结构化文本上微调来更新事实知识时,常常面临1)依赖计算密集的改写增强和2)反向诅咒的问题。最近的研究表明,扩散大语言模型(dLLMs)在预训练中需要更少的训练样本以达到更低的损失,并且对反向诅咒更具抵抗力,表明dLLMs可能比自回归大语言模型(arLLMs)更容易学习新知识。我们通过受控的知识微调实验检验这一假设,发现虽然arLLMs依赖改写增强将知识文本泛化为问答(QA)能力,但dLLMs无需改写即可实现高QA准确性。为进一步研究是否仅凭去掩码目标就能在dLLMs中诱导这种知识注入优势,无论其扩散去噪范式如何,我们提出了arLLMs的掩码微调方法,该方法促使arLLM在上下文中的掩码版本中重建原始文本。arLLMs的掩码微调显著提高了知识注入的有效性,即无需改写且对反向诅咒具有抵抗力,缩小了arLLMs与dLLMs之间的差距。我们还展示了更广泛的应用:在大规模知识密集型数据集(120万个样本)上,掩码SFT在GPQA-diamond中实现了所有微调变体中最佳的下游准确性。去掩码目标也提高了数学任务上的SFT,表明其在事实知识注入之外的广泛用途。

英文摘要

Large language models (LLMs) are often used in environments where facts evolve, yet factual knowledge updates via fine-tuning on unstructured text often suffer from 1) reliance on compute-heavy paraphrasing augmentation and 2) the reversal curse. Recent studies show diffusion large language models (dLLMs) require fewer training samples to achieve lower loss in pre-training and are more resistant to the reversal curse, suggesting dLLMs may learn new knowledge more easily than autoregressive LLMs (arLLMs). We test this hypothesis in controlled knowledge fine-tuning experiments and find that while arLLMs rely on paraphrase augmentation to generalize knowledge text into question-answering (QA) capability, dLLMs do not require paraphrases to achieve high QA accuracy. To further investigate whether the demasking objective alone can induce such a knowledge injection advantage in dLLMs regardless of their diffusion denoising paradigm, we propose masked fine-tuning for arLLMs, which prompts an arLLM to reconstruct the original text given a masked version in context. The masked fine-tuning for arLLMs substantially improves the efficacy of knowledge injection, i.e. no paraphrase needed and resistant to the reversal curse, closing the gap between arLLMs and dLLMs. We also demonstrate broader applicability: on a large-scale knowledge-intensive dataset (1.2M samples), masked SFT achieves the best downstream accuracy on GPQA-diamond among all fine-tuning variants. The demasking objective also improves SFT on math tasks, suggesting broad utility beyond factual knowledge injection.

2602.03147 2026-06-11 cs.RO

Multi-function Robotized Surgical Dissector for Endoscopic Pulmonary Thromboendarterectomy: Preclinical Study and Evaluation

多功能机器人化内窥手术分离器用于内窥肺动脉血栓endarterectomy:临床前研究与评估

Runfeng Zhu, Xin Zhong, Qingxiang Zhao, Jing Lin, Zhong Wu, Kang Li

发表机构 * West China Hospital of Medicine, Sichuan University(四川大学华西医学中心) School of Mechanical and Electrical Engineering, University of Electronic Science and Technology of China(电子科技大学机械与电子工程学院)

AI总结 本文提出一种基于同心推拉机器人结构的机器人化内窥手术分离器,用于提升内窥肺动脉血栓endarterectomy手术的灵活性与精度,通过实验验证其物理性能及手术模拟效果。

详情
AI中文摘要

慢性严重肺动脉血栓栓塞患者需通过肺动脉血栓endarterectomy (PTE)手术去除肺动脉内的血栓和内膜。在手术过程中,外科医生需使用镊子和分离器精细剥离阻塞物,但现有工具刚性且直,缺乏远端灵活性以进入细小的肺动脉分支。因此,本文提出一种基于同心推拉机器人(CPPR)结构的新型机器人化分离器,能够进入曲折肺动脉的深部细小分支。与传统刚性分离器相比,本设计具有细长性和双段弯曲灵活性。由于CPPR基分离器具有空心且薄壁结构,其3.5mm直径的细长身体可容纳两个通道用于灌注和尖端工具,以及内窥镜摄像头信号线的空间。为了提供精确的手术操作,建立了基于优化的运动学模型,实现了在开环控制策略下,60mm长度的尖端工具定位精度为2mm。因此,借助内窥镜摄像头,传统PTE可能升级为内窥镜PTE。通过实验评估了机器人化分离器的基本物理性能,包括刚度、运动精度和操作性。在离体猪肺手术模拟中也展示了其灵活性和在PTE中的显著优势。

英文摘要

Patients suffering chronic severe pulmonary thromboembolism need Pulmonary Thromboendarterectomy (PTE) to remove the thromb and intima located inside pulmonary artery (PA). During the surgery, a surgeon holds tweezers and a dissector to delicately strip the blockage, but available tools for this surgery are rigid and straight, lacking distal dexterity to access into thin branches of PA. Therefore, this work presents a novel robotized dissector based on concentric push/pull robot (CPPR) structure, enabling entering deep thin branch of tortuous PA. Compared with conventional rigid dissectors, our design characterizes slenderness and dual-segment-bending dexterity. Owing to the hollow and thin-walled structure of the CPPR-based dissector as it has a slender body of 3.5mm in diameter, the central lumen accommodates two channels for irrigation and tip tool, and space for endoscopic camera's signal wire. To provide accurate surgical manipulation, optimization-based kinematics model was established, realizing a 2mm accuracy in positioning the tip tool (60mm length) under open-loop control strategy. As such, with the endoscopic camera, traditional PTE is possible to be upgraded as endoscopic PTE. Basic physic performance of the robotized dissector including stiffness, motion accuracy and maneuverability was evaluated through experiments. Surgery simulation on ex vivo porcine lung also demonstrates its dexterity and notable advantages in PTE.

2601.21824 2026-06-11 cs.LG cs.DC

DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

DASH:确定性注意力调度用于高吞吐量可重复LLM训练

Xinwei Qiang, Hongmin Chen, Shixuan Sun, Jingwen Leng, Xin Liu, Minyi Guo

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) ByteDance Seed(字节跳动种子) Zhiyuan College, Shanghai Jiao Tong University(上海交通大学智源学院)

AI总结 本文提出DASH算法,通过优化计算和梯度缩减阶段调度,解决确定性注意力训练中的性能损失问题,提升LLM训练效率。

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR), 2026
AI中文摘要

确定性对于大语言模型(LLM)训练的可重复性至关重要,但往往带来显著的性能损失。在广泛使用的注意力实现中,如FlashAttention-3,确定性反向传递的吞吐量可能比非确定性版本减少37.9%,主要原因是梯度累积操作必须串行化以保证数值一致性。为解决这一挑战,我们将确定性注意力的反向传递视为有向无环图(DAG)上的调度问题,并推导出最小化关键路径长度的调度方案。基于此,我们提出了DASH(确定性注意力调度用于高吞吐量),包含两种互补的调度策略:(i)递减Q-块迭代,一种反向查询块遍历,减少因果注意力中的流水线停滞;(ii)移位调度,一种在我们的DAG模型中理论上最优的调度方案,减少全掩码和因果掩码的流水线停滞。我们的实验证明,DASH缩小了确定性注意力的性能差距。与基线相比,所提策略将注意力反向传递的吞吐量提高了1.28倍,显著提升了可重复LLM训练的效率。我们的代码在https://github.com/SJTU-Liquid/deterministic-FA3上开源。

英文摘要

Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non-deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient-reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies: (i) Descending Q-Tile Iteration, a reversed query-block traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks. Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28$\times$ compared to the baseline, significantly advancing the efficiency of reproducible LLM training. Our code is open-sourced at https://github.com/SJTU-Liquid/deterministic-FA3.

2409.00743 2026-06-11 cs.LG cs.AI

Interpretable Clustering: A Survey

可解释聚类:综述

Lianyu Hu, Mudi Jiang, Junjie Dong, Xinying Liu, Zengyou He

发表机构 * College of Information Science and Engineering, Henan University of Technology(河南理工大学信息科学与工程学院) School of Software, Dalian University of Technology(大连理工大学软件学院) Xinchang Power Supply Company, State Grid Corporation of China(国网浙江绍兴供电公司)

AI总结 本文综述了可解释聚类算法的现状,探讨了透明聚类结果的重要性,帮助研究人员选择合适的方法,并推动高效透明的聚类算法发展。

Comments 14 pages, 2 figures, 3 tables

详情
Journal ref
ACM Computing Surveys, Volume 58, Issue 8, Article 215 (2026)
AI中文摘要

近年来,聚类算法的研究主要集中在提高准确性和效率,但往往牺牲了可解释性。随着这些方法在医疗、金融和自动驾驶等高风险领域应用增加,透明和可解释的聚类结果变得至关重要。本文全面回顾了可解释聚类算法,识别了区分不同方法的关键标准,并提供了一个开放仓库,整理了代表性及新兴的可解释聚类方法,网址为https://github.com/hulianyu/Awesome-Interpretable-Clustering

英文摘要

In recent years, much of the research on clustering algorithms has primarily focused on enhancing their accuracy and efficiency, frequently at the expense of interpretability. However, as these methods are increasingly being applied in high-stakes domains such as healthcare, finance, and autonomous systems, the need for transparent and interpretable clustering outcomes has become a critical concern. This is not only necessary for gaining user trust but also for satisfying the growing ethical and regulatory demands in these fields. Ensuring that decisions derived from clustering algorithms can be clearly understood and justified is now a fundamental requirement. To address this need, this paper provides a comprehensive and structured review of the current state of explainable clustering algorithms, identifying key criteria to distinguish between various methods. These insights can effectively assist researchers in making informed decisions about the most suitable explainable clustering methods for specific application contexts, while also promoting the development and adoption of clustering algorithms that are both efficient and transparent. For convenient access and reference, an open repository organizes representative and emerging interpretable clustering methods under the taxonomy proposed in this survey, available at https://github.com/hulianyu/Awesome-Interpretable-Clustering

2601.09072 2026-06-11 cs.AI cs.CL stat.ME

Human-AI Co-design for Clinical Prediction Models

临床预测模型的人机协同设计

Jean Feng, Avni Kothari, Patrick Vossler, Andrew Bishara, Lucas Zier, Newton Addo, Aaron Kornblith, Yan Shuo Tan, Chandan Singh

发表机构 * University of California, San Francisco(加州大学旧金山分校) National University of Singapore(新加坡国立大学) Microsoft Research(微软研究院)

AI总结 本文提出HACHI框架,通过人机协作加速可解释的临床预测模型开发,提升模型泛化能力并发现新临床概念。

详情
Journal ref
npj Digital Medicine 2026
AI中文摘要

开发安全、有效且实用的临床预测模型(CPMs)传统上需要临床专家、数据科学家和信息学家的迭代合作。此过程精炼模型构建的细微之处,如选择哪些特征/患者以及如何定义临床类别。然而,这种传统协作过程极为耗时且资源密集,导致只有少量CPMs达到临床应用。当团队试图整合非结构化临床笔记时,这一挑战尤为严峻。为解决此问题,我们引入HACHI,一种迭代的人在回路框架,利用AI代理加速开发完全可解释的CPMs。HACHI交替进行(i)AI代理快速探索和评估临床笔记中的候选概念,以及(ii)临床和领域专家提供反馈以改进CPM学习过程。HACHI将概念定义为简单的yes-no问题,用于线性模型,使临床AI团队能够透明地审查、完善和验证每一轮学习的CPM。在两个真实世界预测任务(急性肾损伤和创伤性脑损伤)中,HACHI优于现有方法,揭示了未包含在常用CPMs中的新临床相关概念,并提升了模型在不同临床站点和时间期的泛化能力。此外,HACHI揭示了临床AI团队的关键作用,如指导AI代理探索其未曾考虑的概念,调整其考虑的概念粒度,更改目标函数以更好地与临床目标一致,并识别数据偏差和泄漏问题。

英文摘要

Developing safe, effective, and practically useful clinical prediction models (CPMs) traditionally requires iterative collaboration between clinical experts, data scientists, and informaticists. This process refines the often small but critical details of the model building process, such as which features/patients to include and how clinical categories should be defined. However, this traditional collaboration process is extremely time- and resource-intensive, resulting in only a small fraction of CPMs reaching clinical practice. This challenge intensifies when teams attempt to incorporate unstructured clinical notes, which can contain an enormous number of concepts. To address this challenge, we introduce HACHI, an iterative human-in-the-loop framework that uses AI agents to accelerate the development of fully interpretable CPMs by enabling the exploration of concepts in clinical notes. HACHI alternates between (i) an AI agent rapidly exploring and evaluating candidate concepts in clinical notes and (ii) clinical and domain experts providing feedback to improve the CPM learning process. HACHI defines concepts as simple yes-no questions that are used in linear models, allowing the clinical AI team to transparently review, refine, and validate the CPM learned in each round. In two real-world prediction tasks (acute kidney injury and traumatic brain injury), HACHI outperforms existing approaches, surfaces new clinically relevant concepts not included in commonly-used CPMs, and improves model generalizability across clinical sites and time periods. Furthermore, HACHI reveals the critical role of the clinical AI team, such as directing the AI agent to explore concepts that it had not previously considered, adjusting the granularity of concepts it considers, changing the objective function to better align with the clinical objectives, and identifying issues of data bias and leakage.

2512.20011 2026-06-11 cs.CV

PaveSync: A Unified and Comprehensive Dataset for Pavement Distress Analysis and Classification

PaveSync:一种用于路面病害分析和分类的统一且全面的数据集

Blessing Agyei Kyem, Joshua Kofi Asamoah, Anthony Dontoh, Andrews Danyo, Eugene Denteh, Armstrong Aboah

发表机构 * University of Ghana(加纳大学)

AI总结 本文提出PaveSync数据集,整合多来源数据,涵盖13种路面病害类型,为统一训练和评估提供标准化资源,通过与先进检测模型的对比验证其有效性。

详情
Journal ref
2025 IEEE International Conference on Future Machine Learning and Data Science (FMLDS)
AI中文摘要

自动化路面缺陷检测常因缺乏标准化数据集而难以在多样现实条件下泛化。现有数据集在标注风格、病害类型定义和格式上存在差异,限制了统一训练。为解决此问题,我们引入了一个综合基准数据集,整合多个公开来源,形成包含7个国家52747张图像的标准集合,包含135277个边界框注释,覆盖13种不同的病害类型。该数据集捕捉了图像质量、分辨率、视角和天气条件的广泛现实变化,为一致的训练和评估提供了独特资源。其有效性通过与最先进的目标检测模型(如YOLOv8-YOLOv12、Faster R-CNN和DETR)的基准测试得到验证,这些模型在多样场景中实现了竞争性性能。通过标准化类别定义和标注格式,该数据集为路面缺陷检测提供了首个全球代表性基准,使模型比较更加公平,包括零样本迁移至新环境的能力。

英文摘要

Automated pavement defect detection often struggles to generalize across diverse real-world conditions due to the lack of standardized datasets. Existing datasets differ in annotation styles, distress type definitions, and formats, limiting their integration for unified training. To address this gap, we introduce a comprehensive benchmark dataset that consolidates multiple publicly available sources into a standardized collection of 52747 images from seven countries, with 135277 bounding box annotations covering 13 distinct distress types. The dataset captures broad real-world variation in image quality, resolution, viewing angles, and weather conditions, offering a unique resource for consistent training and evaluation. Its effectiveness was demonstrated through benchmarking with state-of-the-art object detection models including YOLOv8-YOLOv12, Faster R-CNN, and DETR, which achieved competitive performance across diverse scenarios. By standardizing class definitions and annotation formats, this dataset provides the first globally representative benchmark for pavement defect detection and enables fair comparison of models, including zero-shot transfer to new environments.

2512.08343 2026-06-11 cs.AI

Soil Compaction Parameters Prediction Based on Automated Machine Learning Approach

基于自动化机器学习方法的土壤压实参数预测

Caner Erden, Alparslan Serhat Demir, Abdullah Hulusi Kokcam, Talas Fikret Kurnaz, Ugur Dagdeviren

发表机构 * Sakarya University of Applied Sciences, Faculty of Technology, Department of Computer Engineering(萨卡里亚应用科学大学技术学院计算机工程系)

AI总结 本文提出自动化机器学习方法用于预测土壤压实参数,通过实验发现XGBoost算法在不同土壤类型中表现最佳,提升了预测准确性和通用性。

Comments Presented at the 13th International Symposium on Intelligent Manufacturing and Service Systems, Duzce, Turkey, Sep 25-27, 2025. Also available on Zenodo: DOI 10.5281/zenodo.17533851

详情
Journal ref
Computers & Industrial Engineering, 2026
AI中文摘要

土壤压实在土木工程中至关重要,以确保路基和土坝等结构的稳定性。传统方法确定最优含水率(OMC)和最大干密度(MDD)需要大量实验室实验,经验回归模型在不同土壤类型中应用有限。近年来,人工智能(AI)和机器学习(ML)技术作为替代方法出现,但ML模型在预测准确性和泛化能力上仍有不足,尤其是面对异质数据集时。本文提出自动化机器学习(AutoML)方法来预测OMC和MDD。AutoML自动选择算法和超参数优化,可能提高准确性和可扩展性。通过广泛实验发现,极端梯度提升(XGBoost)算法在独立数据集上分别达到MDD的R-squared值80.4%和OMC的89.1%。这些结果展示了AutoML在不同土壤类型中预测压实参数的有效性。研究还强调了异质数据集在提高ML模型泛化能力和性能中的重要性。最终,本研究通过提升土壤压实参数的预测能力,为更高效和可靠的施工实践做出了贡献。

英文摘要

Soil compaction is critical in construction engineering to ensure the stability of structures like road embankments and earth dams. Traditional methods for determining optimum moisture content (OMC) and maximum dry density (MDD) involve labor-intensive laboratory experiments, and empirical regression models have limited applicability and accuracy across diverse soil types. In recent years, artificial intelligence (AI) and machine learning (ML) techniques have emerged as alternatives for predicting these compaction parameters. However, ML models often struggle with prediction accuracy and generalizability, particularly with heterogeneous datasets representing various soil types. This study proposes an automated machine learning (AutoML) approach to predict OMC and MDD. AutoML automates algorithm selection and hyperparameter optimization, potentially improving accuracy and scalability. Through extensive experimentation, the study found that the Extreme Gradient Boosting (XGBoost) algorithm provided the best performance, achieving R-squared values of 80.4% for MDD and 89.1% for OMC on a separate dataset. These results demonstrate the effectiveness of AutoML in predicting compaction parameters across different soil types. The study also highlights the importance of heterogeneous datasets in improving the generalization and performance of ML models. Ultimately, this research contributes to more efficient and reliable construction practices by enhancing the prediction of soil compaction parameters.

2510.11290 2026-06-11 cs.AI cs.HC

Evolution in Simulation: AI-Agent School with Dual Memory for High-Fidelity Educational Dynamics

模拟中的进化:具有双记忆的AI代理学校用于高保真的教育动态

Sheng Jin, Haoming Wang, Zhiqi Gao, Yongbo Yang, Bao Chunjia, Chengliang Wang

发表机构 * Guanghua Law School, Zhejiang University(浙江大学法学院) Faculty of Education, East China Normal University(华东师范大学教育学院) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) Department of Electrical and Computer Engineering, University of California San Diego(加州大学圣地亚哥分校电子与计算机工程系) Institute of Systems Science, National University of Singapore(新加坡国立大学系统科学研究所)

AI总结 本文提出AI代理学校系统,通过自演化机制模拟复杂教育动态,采用双记忆结构提升代理认知能力,实验证实其在高保真模拟中的有效性。

Comments 9 pages, 7 figures, EMNLP conference

详情
Journal ref
Findings of the Association for Computational Linguistics: EMNLP 2025
AI中文摘要

大型语言模型(LLMs)基于的代理在模拟和理解复杂人类系统和互动中日益关键。我们提出AI代理学校(AAS)系统,围绕自演化机制,利用代理模拟复杂教育动态。针对教学过程建模碎片化和代理在模拟多样化教育参与者方面的局限性,AAS构建了零经验策略,采用连续的

英文摘要

Large language models (LLMs) based Agents are increasingly pivotal in simulating and understanding complex human systems and interactions. We propose the AI-Agent School (AAS) system, built around a self-evolving mechanism that leverages agents for simulating complex educational dynamics. Addressing the fragmented issues in teaching process modeling and the limitations of agents performance in simulating diverse educational participants, AAS constructs the Zero-Exp strategy, employs a continuous "experience-reflection-optimization" cycle, grounded in a dual memory base comprising experience and knowledge bases and incorporating short-term and long-term memory components. Through this mechanism, agents autonomously evolve via situated interactions within diverse simulated school scenarios. This evolution enables agents to more accurately model the nuanced, multi-faceted teacher-student engagements and underlying learning processes found in physical schools. Experiment confirms that AAS can effectively simulate intricate educational dynamics and is effective in fostering advanced agent cognitive abilities, providing a foundational stepping stone from the "Era of Experience" to the "Era of Simulation" by generating high-fidelity behavioral and interaction data.

2510.06242 2026-06-11 cs.CL cs.AI

Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses

透明参考-free 自动评估开放式用户调查回应

Subin An, Yugyeong Ji, Junyoung Kim, Heejin Kook, Yang Lu, Josh Seltzer

发表机构 * Kookmin University(韩国明知大学) Sungkyunkwan University(庆尚大学) Nexxt Intelligence

AI总结 本文提出一种两阶段评估框架,用于评估人类开放式调查回应,通过去除无意义回应和评估努力、相关性和完整性,提升自动评估效果。

Comments EMNLP Industry Track

详情
Journal ref
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
AI中文摘要

开放式调查回应为营销研究提供有价值见解,但低质量回应不仅使研究人员负担手动筛选,还可能引起误导性结论,凸显了有效评估的必要性。现有自动评估方法针对LLM生成文本,无法充分评估具有独特特征的人类回应。为解决此类特征,我们提出一种专门针对人类调查回应的两阶段评估框架。首先,垃圾语过滤去除无意义回应。然后,通过LLM能力评估三个维度——努力、相关性和完整性,基于实际调查数据的实证分析。在英语和韩语数据集上的验证表明,我们的框架不仅优于现有指标,而且在实际应用如回应质量预测和回应拒绝中显示出高实用性,与专家评估显示强相关性。

英文摘要

Open-ended survey responses provide valuable insights in marketing research, but low-quality responses not only burden researchers with manual filtering but also risk leading to misleading conclusions, underscoring the need for effective evaluation. Existing automatic evaluation methods target LLM-generated text and inadequately assess human-written responses with their distinct characteristics. To address such characteristics, we propose a two-stage evaluation framework specifically designed for human survey responses. First, gibberish filtering removes nonsensical responses. Then, three dimensions-effort, relevance, and completeness-are evaluated using LLM capabilities, grounded in empirical analysis of real-world survey data. Validation on English and Korean datasets shows that our framework not only outperforms existing metrics but also demonstrates high practical applicability for real-world applications such as response quality prediction and response rejection, showing strong correlations with expert assessment.