arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1938
2510.17269 2026-05-21 cs.CV cs.AI

FineVision: Open Data Is All You Need

FineVision: 你只需要开放数据

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti

发表机构 * Hugging Face Technical University of Munich(慕尼黑技术大学) Stanford University(斯坦福大学)

AI总结 本文提出FineVision,一个包含2400万样本的高质量数据集,通过半自动化流程整合了200多个来源,通过严格的数据清洗和人工审核确保数据质量,训练基于该数据集的模型在广泛评估中表现更优,推动数据驱动的视觉语言模型研究。

详情
AI中文摘要

视觉语言模型(VLMs)的进步受到碎片化、不一致和受污染的公共数据集的阻碍。我们引入了FineVision,一个精心收集、整理和统一的2400万样本数据集,是最大的开放资源。我们通过半自动化、人机协作的流程将超过200个来源整合为185个子集:自动化处理大量数据和模式映射,而审核员检查映射并抽查输出以验证注释的忠实消费、适当的格式和多样性以及安全性;问题会触发针对性的修复和重新运行。该流程进一步在源内和跨源之间应用严格的去重,并针对66个公共基准进行去污染。FineVision还包含具有统一动作空间的代理/GUI任务;审核员验证模式并检查样本轨迹以确认可执行性。在广泛评估套件中,基于FineVision训练的模型始终优于基于现有开放混合数据训练的模型,凸显了规模、数据卫生和平衡自动化与人工监督的好处。我们发布该数据集和整理工具以加速数据驱动的VLM研究。

英文摘要

The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

2510.14444 2026-05-21 cs.LG cs.AI

A Free Lunch in LLM Compression: Revisiting Retraining after Pruning

在LLM压缩中寻找免费午餐:重新审视剪枝后的重新训练

Moritz Wagner, Christophe Roux, Max Zimmer, Sebastian Pokutta

发表机构 * Department for AI in Society, Science, and Technology, Zuse Institute Berlin(人工智能社会、科学与技术系,柏林Zuse研究所) Institute of Mathematics, Technische Universität Berlin(数学系,柏林技术大学)

AI总结 本文研究了在剪枝后通过局部重建进行适应的方法,发现其在减少数据和计算成本的同时能有效提升模型性能,并揭示了在不同粒度下重建参数窗口对最终质量的影响,挑战了LLM剪枝后适应不可行的主流观点。

详情
AI中文摘要

后训练剪枝可以显著降低LLM推理成本,但除非剩余权重被适应,否则往往会降质。由于在LLM规模上全局重新训练成本高昂,近期研究大多集中在日益复杂的剪枝标准上,旨在选择更好的稀疏模式而不进行适应。我们通过局部重建重新审视这一权衡:在剪枝后,我们依次在校准集上适应模型参数的一个子集,训练其以匹配密集模型的相应中间激活值。我们评估了局部重建在不同模型家族和规模上的表现,最高达到72B参数,并得出三个主要发现。首先,局部重建是LLM的有效适应机制:它在剪枝后重新训练时,使用了超过一个数量级更少的数据和计算资源,即使使用PEFT技术也是如此。其次,重建在粒度上表现出广泛的“免费午餐”区域,即重建参数窗口:只要重建区域包含至少一个非线性子模块,最终质量对窗口大小几乎不敏感,允许粒度主要基于内存约束来选择。相比之下,重建单个矩阵,尽管是文献中常提出的方法,却持续表现不佳,因为小的矩阵级误差会积累成更大的激活漂移。最后,重建减少了剪枝标准的相对重要性:随着模型规模的增加,复杂标准与简单基线之间的性能差距缩小,使简单方法再次具有竞争力。总体而言,我们的结果挑战了LLM剪枝后适应不可行的主流观点。

英文摘要

Post-training pruning can substantially reduce LLM inference costs, but it often degrades quality unless the remaining weights are adapted. Since global retraining is expensive at LLM scale, recent work has largely focused on increasingly sophisticated pruning criteria that aim to select better sparsity patterns without adaptation. We revisit this trade-off through local reconstruction: after pruning, we adapt one subset of the model parameters at a time on a calibration set, training it to match the corresponding intermediate activations of the dense model. We evaluate local reconstruction across model families and scales, up to 72B parameters, and establish three main findings. First, local reconstruction is an effective adaptation mechanism for LLMs: it matches post-pruning retraining while using over an order of magnitude less data and compute, even when using PEFT techniques. Second, reconstruction exhibits a broad "free-lunch" regime in granularity, i.e., the reconstruction parameter window: as long as the reconstructed region contains at least a nonlinear submodule, final quality is largely insensitive to the window size, allowing granularity to be chosen primarily based on memory constraints. In contrast, reconstructing individual matrices, despite being the natural approach often proposed in the literature, consistently underperforms, as small matrix-level errors accumulate into larger activation drift. Lastly, reconstruction reduces the relative importance of the pruning criterion: performance gaps between sophisticated criteria and simple baselines shrink with model scale, making simple methods competitive again. Overall, our results challenge the prevailing view that post-pruning adaptation is impractical for LLMs.

2510.09833 2026-05-21 cs.CV

Post Processing of image segmentation using Conditional Random Fields

利用条件随机场对图像分割进行后处理

Aashish Dhawan, Pankaj Bodani, Vishal Garg

发表机构 * Dept. of Computer Science & Engineering(计算机科学与工程系) Space Applications Center(空间应用中心) JMIETI(JMIETI学院) ISRO(印度空间研究组织)

AI总结 本文研究了如何通过条件随机场提升图像分割结果的清晰度,分析了不同CRF类型在低质量卫星图像和高质量航拍照片上的表现,评估了不同方法的优缺点。

Journal ref Proc. 2019 6th International Conference on Computing for Sustainable Global Development (INDIACom), pp. 147-151, 2019

详情
AI中文摘要

图像分割过程的输出通常由于卫星图像的低质量特征而不够清晰。本研究旨在寻找合适的条件随机场(CRF)以提高分割图像的清晰度。我们首先尝试了不同类型的CRF,并研究它们为何适合或不适合我们的目的。我们在两个不同的数据集上评估了我们的方法——具有低质量特征的卫星图像和高质量的航拍照片。在研究过程中,我们尝试了各种CRF,找出在图像上表现最佳的CRF,并将我们的结果与这些数据集进行比较,以展示不同方法的陷阱和潜力。

英文摘要

The output of image the segmentation process is usually not very clear due to low quality features of Satellite images. The purpose of this study is to find a suitable Conditional Random Field (CRF) to achieve better clarity in a segmented image. We started with different types of CRFs and studied them as to why they are or are not suitable for our purpose. We evaluated our approach on two different datasets - Satellite imagery having low quality features and high quality Aerial photographs. During the study we experimented with various CRFs to find which CRF gives the best results on images and compared our results on these datasets to show the pitfalls and potentials of different approaches.

2510.08482 2026-05-21 cs.CV cs.CL

The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

视觉象征性挑战:在手语形式-意义映射上评估视觉-语言模型

Onur Keleş, Aslı Özyürek, Gerardo Ortega, Kadir Gökgöz, Esam Ghaleb

发表机构 * Multimodal Language Department(多模态语言部门) Max Planck Institute for Psycholinguistics(马克斯·普朗克心理语言学研究所) Department of Linguistics(语言学系) Boğaziçi University(博多伊奇大学) Donders Institute for Brain Cognition and Behaviour(多纳尔斯脑认知与行为研究所) Radboud University(拉德堡德大学) Department of Linguistics and Communication(语言学与沟通系) University of Birmingham(伯明翰大学)

AI总结 本文提出一个新颖的视频基准测试,用于评估视觉-语言模型在手语形式-意义映射上的表现,通过心理语言学测量来评估三种任务:语音学手语形式预测、透明度和渐进象征性评分,并发现模型在语音形式预测上表现较好但整体仍低于人类表现。

详情
AI中文摘要

象征性,即语言形式与意义之间的相似性,在手语中普遍存在,为视觉 grounding 提供了自然的测试环境。对于视觉-语言模型(VLMs),挑战在于从动态的人类运动中恢复这种本质的映射,而非静态上下文。我们引入了视觉象征性挑战,一个新颖的基于视频的基准测试,将心理语言学测量适应于评估 VLMs 在三个任务上的表现:(i)语音学手语形式预测(例如,手形、位置),(ii)透明度(从视觉形式推断意义),以及(iii)渐进象征性评分。我们评估了13种最先进的VLMs在零样本和少样本设置下在荷兰手语上的表现,并将其与人类基线进行比较。在语音形式预测上,VLMs恢复了一些手形和位置细节,但表现仍低于人类;在透明度上,它们与人类基线相差甚远;只有顶级模型与人类象征性评分有中等相关性。有趣的是,语音形式预测能力更强的模型更能与人类象征性判断相关联,表明它们对视觉基础结构有共同的敏感性。我们的发现验证了这些诊断任务,并推动了以人类为中心的信号和具身学习方法,用于建模象征性和改进多模态模型中的视觉 grounding。

英文摘要

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

2510.06824 2026-05-21 cs.LG

Efficient numeracy in language models through single-token number embeddings

通过单token数字嵌入提升语言模型的数值处理效率

Linus Kreitner, Paul Hager, Jonathan Mengedoht, Georgios Kaissis, Daniel Rueckert, Martin J. Menten

发表机构 * Chair for AI in Healthcare and Medicine, Technical University of Munich (TUM) and TUM University Hospital, Munich, Germany(人工智能在医疗和医学中的Chair,慕尼黑技术大学(TUM)和慕尼黑技术大学医院,德国慕尼黑) Department of Computing, Imperial College London, UK(计算系,伦敦帝国学院,英国) Munich Center for Machine Learning (MCML), Munich, Germany(慕尼黑机器学习中心(MCML),德国慕尼黑) Hasso Plattner Institute for Digital Engineering, University of Potsdam, Germany(哈索·platzer研究所数字工程学院,波茨坦大学,德国)

AI总结 本文提出BitTokens,一种利用IEEE 754二进制浮点表示将数字编码为单token的方法,使语言模型能更高效地处理数值计算,从而提升其解决复杂问题的能力。

详情
AI中文摘要

为了推动科学和工程领域的进步,大型语言模型(LLMs)必须能够高效处理大量数值数据并解决长计算。目前只能通过外部工具或大量推理链实现,这要么削弱了LLMs的数值表示,要么限制了它们能解决的问题长度。我们发现前沿LLMs解决基本计算需要过多的推理token,这被其分拆单个数字为多个token的分词策略所加剧。这促使了对高效且有效的单token数字编码的需求。我们提出了一组此类编码的准则,并展示现有方法未能满足这些准则。为解决这些不足,我们提出了BitTokens,一种新的编码策略,通过IEEE 754二进制浮点表示将任何数字编码为单个token。通过广泛实验,我们证明我们的BitTokens使即使是小型语言模型也能学习到几乎完美解决基本算术运算的算法。这种新获得的效率可以扩展语言模型能解决的问题长度和复杂性。

英文摘要

To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or extensive reasoning chains, either weakening the numerical representations of LLMs or limiting the length of problems they can solve. We show that frontier LLMs require excessive amounts of reasoning tokens to solve even basic calculations, which is exacerbated by their tokenization strategies that split single numbers into multiple tokens. This motivates the need for efficient and effective single-token number encodings. We introduce a set of desiderata for such encodings and show that existing approaches fail to fulfill them. To address these shortcomings, we propose BitTokens, a novel encoding strategy that represents any number as a single token using its IEEE 754 binary floating-point representation. Through extensive experiments we show that our BitTokens allow even small language models to learn algorithms that solve basic arithmetic operations nearly perfectly. This newly gained efficiency could expand the length and complexity of problems language models can solve.

2510.00520 2026-05-21 cs.CV

CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?

CardioBench: 心脏超声基础模型是否能超越实验室?

Darya Taratynova, Ahmed Aly, Numan Saeed, Mohammad Yaqub

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫莫德·本·扎耶德人工智能大学)

AI总结 本文提出CardioBench,一个用于评估心脏超声基础模型的基准,通过统一多个公开数据集,评估不同模型在零样本、探测和对齐协议下的性能,揭示通用模型在功能任务上表现优异,但细粒度区分任务上存在不足。

详情
AI中文摘要

基础模型正在重塑医学影像,但其在心脏超声中的应用仍然有限,受制于对私有数据集的依赖,限制了可重复的比较。心脏超声具有独特的挑战,包括噪声采集、高帧冗余和有限的多样化公开数据集。为了解决这个问题,我们引入了CardioBench,一个全面的心脏超声基础模型基准。具体而言,CardioBench将八个公开可用的数据集统一为一个标准化的套件,涵盖四个回归和五个分类任务,覆盖功能、结构、诊断和视图识别终点。利用这一框架,我们评估了几种领先的基座模型,包括心脏专用、生物医学和通用编码器,在一致的零样本、探测和对齐协议下。我们的分析显示,尽管通用编码器转移良好,往往接近探测,但在视图分类和细微病理识别等细粒度区分任务上表现不佳。结果表明,能够捕捉心脏时间动态的模型在功能任务上表现最佳,而基于检索的方法在跨数据集的泛化上更加一致。通过发布预处理、分割和公开评估流程,CardioBench建立了可重复的参考点,以指导未来心脏超声和可能其他医学影像基础模型的架构设计。

英文摘要

Foundation models are reshaping medical imaging, yet their application in echocardiography remains limited, hindered by a heavy reliance on private datasets that prevent reproducible comparison. Echocardiography poses unique challenges, including noisy acquisitions, high frame redundancy, and limited diverse public datasets. To address this, we introduce CardioBench, a comprehensive benchmark for echocardiography foundation models. Specifically, CardioBench unifies eight publicly available datasets into a standardized suite spanning four regression and five classification tasks, covering functional, structural, diagnostic, and view recognition endpoints. Leveraging this framework, we evaluate several leading foundation models, including cardiac-specific, biomedical, and general-purpose encoders, under consistent zero-shot, probing, and alignment protocols. Our analysis reveals that while general-purpose encoders transfer well and often close the gap with probing, they struggle significantly with fine-grained distinctions like view classification and subtle pathology recognition. Results indicate that models capturing temporal cardiac dynamics perform best on functional tasks, while retrieval-based approaches generalize more consistently across datasets. By releasing preprocessing, splits, and public evaluation pipelines, CardioBench establishes a reproducible reference point to guide the architectural design of future echocardiography and possibly other medical imaging foundation models.

2509.26627 2026-05-21 cs.AI cs.LG cs.RO

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

TimeRewarder: 通过帧间时间距离从被动视频中学习密集奖励

Yuyang Liu, Chuan Wen, Yihang Hu, Dinesh Jayaraman, Yang Gao

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China(清华大学交叉信息研究院) Shanghai Qi Zhi Institute(上海启智研究院) Shanghai Jiao Tong University(上海交通大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出TimeRewarder方法,通过帧间时间距离从被动视频中学习密集奖励,以提升强化学习在稀疏奖励任务中的性能,实验表明其在多个任务中显著提高了成功率和样本效率。

Comments ICML 2026 spotlight paper

详情
AI中文摘要

设计密集奖励对于强化学习(RL)至关重要,但在机器人学中往往需要大量的手动工作且缺乏可扩展性。一个有前景的解决方案是将任务进展视为密集奖励信号,因为它量化了动作在时间上推动系统向任务完成迈进的程度。我们提出了TimeRewarder,一种简单而有效的奖励学习方法,通过建模帧对之间的时间距离,从被动视频(包括机器人演示和人类视频)中推导出进展估计信号。然后展示如何通过TimeRewarder提供逐步的代理奖励以指导强化学习。在我们对十个具有挑战性的Meta-World任务的全面实验中,我们表明TimeRewarder显著提高了稀疏奖励任务的强化学习性能,仅在每个任务中进行200,000次环境交互时,就实现了9/10任务的几乎完美成功。该方法在最终成功率和样本效率上均优于先前方法和手动设计的环境密集奖励。此外,我们还展示了TimeRewarder预训练可以利用真实世界的人类视频,突显了其作为从多样化视频源中获取丰富奖励信号的可扩展方法的潜力。

英文摘要

Designing dense rewards is crucial for reinforcement learning (RL), yet in robotics it often demands extensive manual effort and lacks scalability. One promising solution is to view task progress as a dense reward signal, as it quantifies the degree to which actions advance the system toward task completion over time. We present TimeRewarder, a simple yet effective reward learning method that derives progress estimation signals from passive videos, including robot demonstrations and human videos, by modeling temporal distances between frame pairs. We then demonstrate how TimeRewarder can supply step-wise proxy rewards to guide reinforcement learning. In our comprehensive experiments on ten challenging Meta-World tasks, we show that TimeRewarder dramatically improves RL for sparse-reward tasks, achieving nearly perfect success in 9/10 tasks with only 200,000 environment interactions per task. This approach outperformed previous methods and even the manually designed environment dense reward on both the final success rate and sample efficiency. Moreover, we show that TimeRewarder pretraining can exploit real-world human videos, highlighting its potential as a scalable approach to rich reward signals from diverse video sources.

2509.25606 2026-05-21 cs.LG

Effective Model Pruning: Measure The Redundancy of Model Components

有效模型剪枝:衡量模型组件的冗余性

Yixuan Wang, Dan P. Guralnik, Saiedeh Akbari, Warren E. Dixon

发表机构 * Department of Mechanical and Aerospace Engineering, University of Florida(佛罗里达大学机械与航空航天工程系) Department of Mathematics, Ohio University(俄亥俄大学数学系)

AI总结 本文研究了模型剪枝中的基本问题,提出了一种基于有效样本大小的剪枝方法,通过分析重要性评分分布来确定可丢弃的组件数量,并在多种网络架构上验证了该方法的有效性。

Comments 18 pages, 4 figures. Accepted at ICML 2026 (Spotlight)

详情
AI中文摘要

本文开创性地研究了模型剪枝中的基本问题:给定一个分配给模型组件的重要性评分向量s,如何确定在不牺牲性能的情况下可以丢弃多少评分组件?我们提出了有效模型剪枝(EMP),该方法通过粒子过滤中的有效样本大小概念(也称为逆西姆逊指数)直接从评分分布中推导出所需的稀疏性。EMP提供了一个通用的自适应阈值,该阈值基于评分s在模型组件上的分布:EMP将s映射到一个称为有效样本大小的数值N_eff(s)。丢弃N-N_eff分值最低的组件。推导了有效质量s_eff(保留的标准化评分总和)关于N_eff的紧下界。这一过程产生了一个相对于原始密集模型具有可证明上界损失变化的模型。在多种网络架构上进行了数值实验,包括MLPs、CNNs、Transformers、LLMs和KAN。还展示了EMP能够处理多种剪枝标准,如权重大小、注意力评分、KAN重要性评分以及特征级信号如图像像素。

英文摘要

This article initiates the study of a basic question about model pruning. Given a vector $s$ of importance scores assigned to model components, how many of the scored components could be discarded without sacrificing performance? We propose Effective Model Pruning (EMP), which derives the desired sparsity directly from the score distribution using the notion of effective sample size from particle filtering, also known as the inverse Simpson index. Rather than prescribe a pruning criterion, EMP supplies a universal adaptive threshold derived from the distribution of the score $s$ over the model components: EMP maps $s$ to a number $N_{eff}=N_{eff}(s)$, called the effective sample size. The $N-N_{eff}$ lowest scoring components are discarded. A tight lower bound on the effective mass $s_{eff}$ (the sum of retained normalized scores) in terms of $N_{eff}$ is derived. This process yields models with a provable upper bound on the loss change relative to the original dense model. Numerical experiments are performed demonstrating this phenomenon across a variety of network architectures including MLPs, CNNs, Transformers, LLMs, and KAN. It is also shown that EMP addresses a rich set of pruning criteria such as weight magnitude, attention score, KAN importance score, and even feature-level signals such as image pixels.

2509.22963 2026-05-21 cs.LG

Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces

基于离散扩散策略的强化学习

Haitong Ma, Ofir Nabati, Aviv Rosenberg, Bo Dai, Oran Lang, Craig Boutilier, Na Li, Shie Mannor, Lior Shani, Guy Tenneholtz

发表机构 * Google Research(谷歌研究) Harvard University(哈佛大学) Google DeepMind(谷歌DeepMind) Nvidia Research(Nvidia研究)

AI总结 本文提出了一种新的框架,用于在复杂的组合动作空间中训练高效的离散扩散模型策略,通过高效的在线训练过程和策略镜像下降方法,实现了稳定的策略改进,并在多个挑战性组合基准上取得了最先进的性能。

Comments 22 pages, 10 figures. Haitong Ma and Ofir Nabati contributed equally to this paper

详情
AI中文摘要

强化学习(RL)在面对许多现实问题中常见的大规模组合动作空间时面临扩展困难。本文介绍了一种新的框架,用于训练离散扩散模型作为这些复杂设置中的高效策略。我们的关键创新是一个高效的在线训练过程,确保了稳定的策略改进。通过利用策略镜像下降(PMD)来定义一个理想的、正则化的目标策略分布,我们将策略更新框架为一个分布匹配问题,训练具有表现力的扩散模型以复制这个稳定的靶向分布。这种解耦方法稳定了学习过程,并显著提高了训练性能。我们的方法在一系列具有挑战性的组合基准上实现了最先进的结果和优越的样本效率,包括DNA序列生成、具有宏动作的强化学习和多智能体系统。实验表明,我们的扩散策略在与其他基线相比时表现出优越的性能。

英文摘要

Reinforcement learning (RL) struggles to scale to large, combinatorial action spaces common in many real-world problems. This paper introduces a novel framework for training discrete diffusion models as highly effective policies in these complex settings. Our key innovation is an efficient online training process that ensures stable and effective policy improvement. By leveraging policy mirror descent (PMD) to define an ideal, regularized target policy distribution, we frame the policy update as a distributional matching problem, training the expressive diffusion model to replicate this stable target. This decoupled approach stabilizes learning and significantly enhances training performance. Our method achieves state-of-the-art results and superior sample efficiency across a diverse set of challenging combinatorial benchmarks, including DNA sequence generation, RL with macro-actions, and multi-agent systems. Experiments demonstrate that our diffusion policies attain superior performance compared to other baselines.

2509.17931 2026-05-21 cs.CV physics.med-ph

Multi-needle Localization for Pelvic Seed Implant Brachytherapy based on Tip-handle Detection and Matching

基于尖端-柄检测与匹配的盆腔种子植入近距离放射治疗多针定位

Zhuo Xiao, Fugen Zhou, Jingjing Wang, Chongyu He, Bo Liu, Haitao Sun, Zhe Ji, Yuliang Jiang, Junjie Wang, Qiuwen Wu

发表机构 * Image Processing Center, Beihang University(北航图像处理中心) Department of Radiation Oncology, Peking University Third Hospital(北京大学第三医院放疗科) Department of Radiation Oncology, Duke University Medical Center(达勒姆大学医学中心放疗科)

AI总结 本文提出了一种基于尖端-柄检测与匹配的新方法,用于解决术中CT图像中多针定位的难题,通过锚点自由网络和贪心匹配与合并方法,在100名患者的数据集上实现了更高的精度和F1分数,为复杂临床场景下的针定位提供了更鲁棒和准确的解决方案。

详情
AI中文摘要

在术中CT图像中实现准确的多针定位对于优化盆腔种子植入近距离放射治疗中的种子放置至关重要。然而,由于图像对比度差和针管粘附,这一任务具有挑战性。本文提出了一种新颖的方法,将针定位重新框架为尖端-柄检测与匹配问题,以克服这些困难。提出了一种基于HRNet的锚点自由网络,用于提取多尺度特征,并通过解耦分支进行热图回归和极角预测,准确检测针尖和柄。为了将检测到的尖端和柄关联为个体针,提出了一种贪心匹配与合并(GMM)方法,该方法设计用于解决具有约束条件的不平衡分配问题(UAP-C)。GMM方法通过迭代选择最可能的尖端-柄对并基于距离度量进行合并,以重建3D针路径。在100名患者的数据集上评估,所提方法表现出优越的性能,其精度和F1分数优于使用nnUNet模型的基于分割的方法,从而为复杂临床场景中的针定位提供了更稳健和准确的解决方案。

英文摘要

Accurate multi-needle localization in intraoperative CT images is crucial for optimizing seed placement in pelvic seed implant brachytherapy. However, this task is challenging due to poor image contrast and needle adhesion. This paper presents a novel approach that reframes needle localization as a tip-handle detection and matching problem to overcome these difficulties. An anchor-free network, based on HRNet, is proposed to extract multi-scale features and accurately detect needle tips and handles by predicting their centers and orientations using decoupled branches for heatmap regression and polar angle prediction. To associate detected tips and handles into individual needles, a greedy matching and merging (GMM) method designed to solve the unbalanced assignment problem with constraints (UAP-C) is presented. The GMM method iteratively selects the most probable tip-handle pairs and merges them based on a distance metric to reconstruct 3D needle paths. Evaluated on a dataset of 100 patients, the proposed method demonstrates superior performance, achieving higher precision and F1 score compared to a segmentation-based method utilizing the nnUNet model,thereby offering a more robust and accurate solution for needle localization in complex clinical scenarios.

2509.14165 2026-05-21 cs.CV cs.AI

Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

令牌去哪了?在高分辨率下的STEP中理解剪枝行为

Michal Szczepanski, Martyna Poreba, Karim Haroun

发表机构 * Université Paris-Saclay, CEA, List(巴黎-萨克雷大学,CEA,List) I3S, Université Côte d’Azur, CNRS(I3S,尼斯大学,CNRS)

AI总结 本文提出STEP框架,通过动态补丁合并和令牌剪枝提高效率,同时在高分辨率语义分割任务中实现显著的计算成本降低和吞吐量提升,同时保持较高的准确性。

Journal ref SN Computer Science 2026

详情
AI中文摘要

视觉变换器(ViTs)在语义分割任务中实现了最先进的性能,但受到高计算和内存成本的限制。为了解决这一问题,我们提出了STEP(SuperToken和Early-Pruning),一种混合的令牌减少框架,结合动态补丁合并和令牌剪枝,以提高效率而不显著牺牲准确性。STEP的核心是dCTS,一个轻量级的CNN基政策网络,能够灵活地合并为超补丁。编码器块也集成了早期退出,以移除高置信度的超令牌,从而降低计算负载。我们在高分辨率语义分割基准上评估了我们的方法,包括高达1024x1024像素的图像,并显示当仅应用dCTS时,令牌数量可以比标准的16x16像素补丁方案减少2.5倍。这在使用ViT-Large作为骨干时,导致计算成本减少2.6倍,吞吐量增加3.4倍。应用完整的STEP框架进一步提高效率,达到计算复杂度减少4倍,推理速度提高1.7倍,最大精度下降不超过2.0%。通过提出的STEP配置,可以自信地在到达最终编码器层之前停止多达40%的令牌。

英文摘要

Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.

2509.13648 2026-05-21 cs.LG cs.IR

Sequential Data Augmentation for Generative Recommendation

生成推荐中的序列数据增强

Geon Lee, Bhuvesh Kumar, Clark Mingxuan Ju, Tong Zhao, Kijung Shin, Neil Shah, Liam Collins

发表机构 * Snap Inc.(Snap公司)

AI总结 本文研究了生成推荐中数据增强的影响,提出了一种系统化的框架GenPAS,通过三种受偏步骤统一了多种增强策略,提升了模型的准确率、数据效率和参数效率。

详情
AI中文摘要

生成推荐在个性化系统中起着关键作用,通过预测用户的历史行为序列来预测用户未来的行为。在训练这些模型时,数据增强是一个关键但尚未充分研究的因素,即从用户交互历史中构建训练数据的过程。通过塑造训练分布,数据增强直接影响模型的泛化能力和性能。然而,在现有工作中,这一过程通常被简化、应用不一致或被视为次要设计选择,而没有系统和原则性的理解。受我们实证发现不同增强策略会产生显著性能差异的启发,我们深入分析了它们如何重塑训练分布并影响与未来目标的对齐以及对未见输入的泛化能力。为了系统化这一设计空间,我们提出GenPAS,一个通用且原则性的框架,将增强建模为输入-目标对上的随机采样过程,包含三个受偏步骤:序列采样、目标采样和输入采样。这种形式将广泛使用的策略作为特殊情况统一起来,并使训练分布的灵活控制成为可能。我们在基准和工业数据集上的大量实验表明,GenPAS在准确率、数据效率和参数效率方面优于现有策略,为生成推荐中原则性的训练数据构建提供了实用指导。我们的代码可在https://github.com/snap-research/GenPAS上获得。

英文摘要

Generative recommendation plays a crucial role in personalized systems, predicting users' future interactions from their historical behavior sequences. A critical yet underexplored factor in training these models is data augmentation, the process of constructing training data from user interaction histories. By shaping the training distribution, data augmentation directly and often substantially affects model generalization and performance. Nevertheless, in much of the existing work, this process is simplified, applied inconsistently, or treated as a minor design choice, without a systematic and principled understanding of its effects. Motivated by our empirical finding that different augmentation strategies can yield large performance disparities, we conduct an in-depth analysis of how they reshape training distributions and influence alignment with future targets and generalization to unseen inputs. To systematize this design space, we propose GenPAS, a generalized and principled framework that models augmentation as a stochastic sampling process over input-target pairs with three bias-controlled steps: sequence sampling, target sampling, and input sampling. This formulation unifies widely used strategies as special cases and enables flexible control of the resulting training distribution. Our extensive experiments on benchmark and industrial datasets demonstrate that GenPAS yields superior accuracy, data efficiency, and parameter efficiency compared to existing strategies, providing practical guidance for principled training data construction in generative recommendation. Our code is available at https://github.com/snap-research/GenPAS.

2509.13482 2026-05-21 cs.CV

Improving 3D Gaussian Splatting Compression by Scene-Adaptive Lattice Vector Quantization

通过场景自适应晶格向量量化改进3D高斯散射压缩

Hao Xu, Xiaolin Wu, Xi Zhang

发表机构 * Department of Electrical & Computer Engineering, McMaster University(麦卡特尼大学电气与计算机工程系) School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院)

AI总结 本文提出了一种场景自适应晶格向量量化(SALVQ)方法,用于改进3D高斯散射(3DGS)的压缩性能,通过优化晶格基矢来提高适应性和R-D效率,同时减少计算开销和训练时间。

Comments Accepted by IEEE TIP. Code available at https://github.com/hxu160/SALVQ

详情
AI中文摘要

3D高斯散射(3DGS)因其逼真渲染质量和实时性能而迅速流行,但会产生大量数据。因此,压缩3DGS数据对于其模型的成本效益至关重要。最近,一些基于锚点的神经压缩方法已被提出,实现了良好的3DGS压缩性能。然而,它们都依赖于统一标量量化(USQ)因其简单性。一个引人注目的问题是,更复杂的量化器是否能在极小的额外开销和系统最小变化的情况下改进当前的3DGS压缩方法。答案是肯定的,通过将USQ替换为晶格向量量化(LVQ)。为了更好地捕捉场景特定特性,我们为每个场景优化晶格基矢,提高LVQ的适应性和R-D效率。这种场景自适应LVQ(SALVQ)在向量量化和USQ的低复杂性之间取得了平衡。SALVQ可以无缝集成到现有的3DGS压缩架构中,通过最小的修改和计算开销提高其R-D性能。此外,通过缩放晶格基矢量,SALVQ可以动态调整晶格密度,使单个模型能够适应多种比特率目标。这种灵活性消除了为不同压缩级别训练单独模型的需要,显著减少了训练时间和内存消耗。

英文摘要

3D Gaussian Splatting (3DGS) is rapidly gaining popularity for its photorealistic rendering quality and real-time performance, but it generates massive amounts of data. Hence compressing 3DGS data is necessary for the cost effectiveness of 3DGS models. Recently, several anchor-based neural compression methods have been proposed, achieving good 3DGS compression performance. However, they all rely on uniform scalar quantization (USQ) due to its simplicity. A tantalizing question is whether more sophisticated quantizers can improve the current 3DGS compression methods with very little extra overhead and minimal change to the system. The answer is yes by replacing USQ with lattice vector quantization (LVQ). To better capture scene-specific characteristics, we optimize the lattice basis for each scene, improving LVQ's adaptability and R-D efficiency. This scene-adaptive LVQ (SALVQ) strikes a balance between the R-D efficiency of vector quantization and the low complexity of USQ. SALVQ can be seamlessly integrated into existing 3DGS compression architectures, enhancing their R-D performance with minimal modifications and computational overhead. Moreover, by scaling the lattice basis vectors, SALVQ can dynamically adjust lattice density, enabling a single model to accommodate multiple bit rate targets. This flexibility eliminates the need to train separate models for different compression levels, significantly reducing training time and memory consumption.

2509.09946 2026-05-21 cs.CV

Online 3D Multi-Camera Perception through Robust 2D Tracking and Depth-based Late Aggregation

通过鲁棒的2D跟踪和基于深度的后期聚合实现在线3D多摄像机感知

Vu-Minh Le, Thao-Anh Tran, Duc Huy Do, Xuan Canh Do, Huong Ninh, Hai Tran

发表机构 * Optoelectronics Center, Viettel Aerospace Institute, Viettel Group(Viettel集团光学电子中心、 Viettel航空航天研究所) University of Engineering and Technology, Vietnam National University(越南国家大学工程大学) School of Electrical and Electronic Engineering, Hanoi University of Science and Technology(河内科学技术大学电子与电气工程学院)

AI总结 本文提出了一种方法,通过利用深度信息将现有的在线2D多摄像机跟踪系统扩展到3D空间,通过点云空间重建目标并利用聚类和偏转细化恢复其3D框,同时引入了增强的在线数据关联机制,以局部ID一致性来分配跨帧的全局ID,该框架在2025年AI城市挑战赛的3D MTMC数据集上评估,取得了第三名的成绩。

Comments Accepted at ICCVW 2025

详情
AI中文摘要

多目标多摄像机跟踪(MTMC)是自动化大规模监控中的关键计算机视觉任务。通过摄像机标定和深度信息,场景中的目标可以投影到3D空间,提供对3D环境的前所未有的自动感知水平。然而,在3D空间中的跟踪需要替换所有2D跟踪组件,这可能对现有的MTMC系统不可行。本文提出了一种方法,通过利用深度信息将任何在线2D多摄像机跟踪系统扩展到3D空间,通过点云空间重建目标,并通过聚类和偏转细化恢复其3D框。我们还引入了增强的在线数据关联机制,利用目标的局部ID一致性来分配跨帧的全局ID。所提出的框架在2025年AI城市挑战赛的3D MTMC数据集上进行评估,取得了排行榜第三名的成绩。

英文摘要

Multi-Target Multi-Camera Tracking (MTMC) is an essential computer vision task for automating large-scale surveillance. With camera calibration and depth information, the targets in the scene can be projected into 3D space, offering unparalleled levels of automatic perception of a 3D environment. However, tracking in the 3D space requires replacing all 2D tracking components from the ground up, which may be infeasible for existing MTMC systems. In this paper, we present an approach for extending any online 2D multi-camera tracking system into 3D space by utilizing depth information to reconstruct a target in point-cloud space, and recovering its 3D box through clustering and yaw refinement following tracking. We also introduced an enhanced online data association mechanism that leverages the target's local ID consistency to assign global IDs across frames. The proposed framework is evaluated on the 2025 AI City Challenge's 3D MTMC dataset, achieving 3rd place on the leaderboard.

2509.09215 2026-05-21 cs.AI cs.CR

Enabling Regulatory Multi-Agent Collaboration: Architecture, Challenges, and Solutions

使监管多智能体协作成为可能:架构、挑战与解决方案

Qinnan Hu, Yuntao Wang, Yuan Gao, Zhou Su, Linkang Du, Qichao Xu

发表机构 * School of Cyber Science and Engineering(网络科学与工程学院) Xi'an Jiaotong University(西安交通大学) School of Mechatronic Engineering and Automation(机械电子工程与自动化学院) Shanghai University(上海大学)

AI总结 本文提出了一种基于区块链的分层架构,用于监管智能体协作,设计了三个关键模块以实现自动问责、动态声誉评估和恶意行为预测,从而建立可信、健壮和可扩展的监管机制。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

大型语言模型(LLMs)赋能的自主智能体正在通过实现适应性、多智能体协作改变数字和物理环境。尽管这些智能体在金融、医疗和智能制造等领域提供了显著机会,但其不可预测的行为和异构能力带来了重大治理和问责挑战。在本文中,我们提出了一种基于区块链的分层架构,用于监管智能体协作,包括智能体层、区块链数据层和监管应用层。在此框架内,我们设计了三个关键模块:(i)智能体行为追踪和仲裁模块用于自动化问责,(ii)动态声誉评估模块用于协作场景中的信任评估,(iii)恶意行为预测模块用于早期检测对抗性活动。我们的方法建立了在大规模智能体生态系统中可信、健壮和可扩展的监管机制的系统基础。最后,我们讨论了区块链赋能的监管框架在多智能体系统中的未来研究方向。

英文摘要

Large language models (LLMs)-empowered autonomous agents are transforming both digital and physical environments by enabling adaptive, multi-agent collaboration. While these agents offer significant opportunities across domains such as finance, healthcare, and smart manufacturing, their unpredictable behaviors and heterogeneous capabilities pose substantial governance and accountability challenges. In this paper, we propose a blockchain-enabled layered architecture for regulatory agent collaboration, comprising an agent layer, a blockchain data layer, and a regulatory application layer. Within this framework, we design three key modules: (i) an agent behavior tracing and arbitration module for automated accountability, (ii) a dynamic reputation evaluation module for trust assessment in collaborative scenarios, and (iii) a malicious behavior forecasting module for early detection of adversarial activities. Our approach establishes a systematic foundation for trustworthy, resilient, and scalable regulatory mechanisms in large-scale agent ecosystems. Finally, we discuss the future research directions for blockchain-enabled regulatory frameworks in multi-agent systems.

2509.07674 2026-05-21 cs.RO cs.HC

Temporal Counterfactual Explanations of Behaviour Tree Decisions

行为树决策的时序反事实解释

Tamlin Love, Antonio Andriella, Guillem Alenyà

发表机构 * Institut de Robòtica i Informàtica Industrial, CSIC-UPC(机器人与计算机工业研究所,CSIC-UPC)

AI总结 本文提出了一种生成反事实解释的方法,通过构建行为树的因果模型来解释机器人决策原因,提高了机器人系统的透明性和安全性。

Comments 33 pages, 7 figures + 4 figures in appendices

详情
AI中文摘要

可解释性,特别是机器人能够解释其决策或行为原因的能力,是帮助用户理解其交互和共存的机器人的重要工具。行为树是控制机器人决策的流行框架,因此一个自然的问题是,由行为树驱动的系统是否能够回答'为什么'的问题。尽管行为树驱动的机器人可解释性已受到一些关注,但现有的方法无法生成详细说明机器人决策原因的因果反事实解释。因此,在本工作中,我们介绍了一种新颖的方法,该方法能够自动根据对比性'为什么'问题生成反事实解释。我们的方法通过首先自动构建从行为树结构以及状态和个体行为树节点的领域知识中的因果模型来实现这一点。然后对所得因果模型进行查询和搜索,以找到一组多样的反事实解释。我们证明我们的方法能够正确解释广泛的行为树结构和状态在实时中的行为,与之前的方法相比,这些方法要么无法用因果解释回答对比性问题,要么无法保证提供一致和准确的解释。通过能够回答广泛的因果查询,我们的方法代表了朝着更透明、更易理解和最终更安全和可信的机器人系统迈进的一步。

英文摘要

Explainability, in particular, the ability for robots to explain why they have made a decision or behaved in a certain way, is a critical tool in helping users understand the robots they interact and coexist with. Behaviour trees are a popular framework for controlling the decision-making of robots, and thus a natural question to ask is whether or not a system driven by a behaviour tree is capable of answering "why" questions. While explainability for behaviour tree-driven robots has seen some prior attention, no existing methods are capable of generating causal, counterfactual explanations which detail the reasons for robot decisions and behaviour. Therefore, in this work, we introduce a novel approach which automatically generates counterfactual explanations in response to contrastive "why" questions. Our method achieves this by first automatically building a causal model from the structure of the behaviour tree as well as domain knowledge about the state and individual behaviour tree nodes. The resultant causal model is then queried and searched to find a set of diverse counterfactual explanations. We demonstrate that our approach is able to correctly explain the behaviour of a wide range of behaviour tree structures and states in real time, unlike previous methods which are either unable to answer contrastive questions with causal explanations, or are not guaranteed to provide consistent and accurate explanations. By being able to answer a wide range of causal queries, our approach represents a step towards more transparent, understandable, and ultimately safe and trustworthy robotic systems.

2509.03526 2026-05-21 cs.CL eess.AS

Enhancing Speech Large Language Models through Reinforced Behavior Alignment

通过强化行为对齐增强语音大语言模型

Yansong Liu, Jiateng Li, Yuan Liu

发表机构 * Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出了一种名为强化行为对齐(RBA)的框架,通过自合成方法生成高质量对齐数据来提升语音大语言模型(SpeechLMs)的语言生成能力,实验表明该方法显著提升了SpeechLMs的指令遵循能力,并可扩展至语音问答和语音转文本翻译等任务。

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进展引发了对扩展其语言能力至其他模态的广泛关注,从而催生了能够处理语音或文本格式用户请求的语音基础语言模型(SpeechLMs)。然而,由于跨模态差异,这些SpeechLMs在指令遵循方面仍显著逊色于文本基础的LLMs,特别是在面对用户语音的动态和多变性质时。为了解决这一挑战,本文提出了一种名为强化行为对齐(RBA)的框架,旨在增强SpeechLMs的语言生成能力。RBA不依赖于监督微调的人类标注,而是通过强大的教师LLM生成大量高质量的对齐数据。然后,通过强化学习方法将SpeechLMs的行为与教师模型的行为对齐。实验结果表明,该方法有效提升了SpeechLMs的指令遵循能力,使其优于传统蒸馏基线。关键的是,我们证明RBA可以无缝扩展到包括语音问答和语音到文本翻译在内的任务,在仅使用自生成数据的情况下,在开放基准上取得了最先进的性能。

英文摘要

The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

2508.11354 2026-05-21 cs.CV cs.AI cs.LG

FunduSegmenter: Leveraging the RETFound Foundation Model for Joint Optic Disc and Optic Cup Segmentation in Retinal Fundus Images

FunduSegmenter:利用RETFound基础模型进行视网膜底照相图像中视盘和视杯联合分割

Zhenyi Zhao, Muthu Rama Krishnan Mookiah, Emanuele Trucco

发表机构 * University of Dundee(邓迪大学)

AI总结 本文提出了一种基于RETFound基础模型的FunduSegmenter模型,通过引入一系列新颖模块实现视盘和视杯的联合分割,实验表明该模型在多个数据集上均优于现有方法。

Journal ref Trans. Vis. Sci. Tech. 2026;15(5):14

详情
AI中文摘要

目的:本研究首次将RETFound模型应用于视盘(OD)和视杯(OC)的联合分割。RETFound是一个为眼底相机和光学相干断层扫描图像开发的知名基础模型,已在疾病诊断中表现出色。方法:我们提出FunduSegmenter,该模型整合了一系列新颖模块与RETFound,包括预适配器、解码器、后适配器、带有卷积块注意模块的跳跃连接以及视觉Transformer块适配器。该模型在自有数据集GoDARTS以及四个公开数据集IDRiD、Drishti-GS、RIM-ONE-r3和REFUGE上进行了评估,通过内部验证、外部验证和领域泛化实验进行验证。结果:在内部验证中,平均Dice相似系数达到90.51%,优于所有基线方法,其中nnU-Net为82.91%,DUNet为89.17%,TransUNet为87.91%。在所有外部验证实验中,平均结果比最佳基线高约3%,且在领域泛化中也具有竞争力。结论:本研究探讨了RETFound通过学习潜在通用表示在眼底相机图像中进行OD和OC分割的潜力。我们的FunduSegmenter在整体上优于现有最先进基线方法。所提出的模块是通用的,可以扩展到其他基础模型的微调。临床相关性:该模型在分布内和分布外数据上均表现出强大的稳定性与泛化能力,提供了稳定的OD和OC分割。这是许多自动化任务的关键步骤,从设置准确的视网膜坐标到生物标志物发现。代码和训练权重可在:https://github.com/JusticeZzy/FunduSegmenter上获得。

英文摘要

Purpose: This study introduces the first adaptation of RETFound for joint optic disc (OD) and optic cup (OC) segmentation. RETFound is a well-known foundation model developed for fundus camera and optical coherence tomography images, which has shown promising performance in disease diagnosis. Methods: We propose FunduSegmenter, a model integrating a series of novel modules with RETFound, including a Pre-adapter, a Decoder, a Post-adapter, skip connections with Convolutional Block Attention Module and a Vision Transformer block adapter. The model is evaluated on a proprietary dataset, GoDARTS, and four public datasets, IDRiD, Drishti-GS, RIM-ONE-r3, and REFUGE, through internal verification, external verification and domain generalization experiments. Results: An average Dice similarity coefficient of 90.51% was achieved in internal verification, which outperformed all baselines, some substantially (nnU-Net: 82.91%; DUNet: 89.17%; TransUNet: 87.91%). In all external verification experiments, the average results were about 3% higher than those of the best baseline, and our model was also competitive in domain generalization. Conclusions: This study explored the potential of the latent general representations learned by RETFound for OD and OC segmentation in fundus camera images. Our FunduSegmenter generally outperformed state-of-the-art baseline methods. The proposed modules are general and can be extended to fine-tuning other foundation models. Translational Relevance: The model shows strong stability and generalization on both in-distribution and out-of-distribution data, providing stable OD and OC segmentation. This is an essential step for many automated tasks, from setting the accurate retinal coordinate to biomarker discovery. The code and trained weights are available at: https://github.com/JusticeZzy/FunduSegmenter.

2508.09001 2026-05-21 cs.CL cs.AI cs.LG

Retrospective Sparse Attention for Efficient Long-Context Generation

回顾性稀疏注意力用于高效长上下文生成

Seonghwan Choi, Beomseok Kang, Dongwon Jo, Jae-Joon Kim

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出RetroAttention,一种新的KV缓存更新技术,通过回顾后续解码步骤的KV条目来修正过去的注意力输出,从而提高长上下文生成的效率和准确性。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地应用于长上下文任务,如推理、代码生成和多轮对话。然而,扩展上下文的推理受到键值(KV)缓存的限制,其内存占用与序列长度成线性增长,且在每个解码步骤中主导延迟。尽管最近的KV缓存压缩方法识别并加载重要的少量token,但它们主要集中在输入上下文中,未能解决长时间解码中累积的注意力误差。在本文中,我们引入了RetroAttention,一种新的KV缓存更新技术,通过回顾后续解码步骤的KV条目来修正过去的注意力输出。通过维护一个轻量级的输出缓存,RetroAttention使过去的查询能够高效地补充更多上下文,同时产生最小的延迟开销。这打破了固定注意力输出的范式,允许对先前近似进行持续修正。在长生成基准测试中,RetroAttention在长生成任务中始终优于最先进的(SOTA)KV压缩方法,有效KV暴露量增加高达1.6倍,准确性提高高达21.9%。

英文摘要

Large Language Models (LLMs) are increasingly deployed in long-context tasks such as reasoning, code generation, and multi-turn dialogue. However, inference over extended contexts is bottlenecked by the Key-Value (KV) cache, whose memory footprint grows linearly with sequence length and dominates latency at each decoding step. While recent KV cache compression methods identify and load important few tokens, they focus predominantly on input contexts and fail to address the cumulative attention errors that arise during long decoding. In this paper, we introduce RetroAttention, a novel KV cache update technique that retrospectively revises past attention outputs using newly arrived KV entries from subsequent decoding steps. By maintaining a lightweight output cache, RetroAttention enables past queries to be efficiently supplemented with more contexts, while incurring minimal latency overhead. This breaks the fixed-attention-output paradigm and allows continual correction of prior approximations. Extensive experiments on long-generation benchmarks show that RetroAttention consistently outperforms state-of-the-art (SOTA) KV compression methods, increasing effective KV exposure by up to 1.6$\times$ and accuracy by up to 21.9\%.

2508.08636 2026-05-21 cs.CL

InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling

InternBootcamp 技术报告:通过可验证的任务扩展提升大语言模型推理能力

Peiji Li, Jiasheng Ye, Yongkang Chen, Yichuan Ma, Zijie Yu, Kedi Chen, Xiaozhe Li, Ganqu Cui, Haozhan Li, Jiacheng Chen, Chengqi Lyu, Wenwei Zhang, Linyang Li, Qipeng Guo, Dahua Lin, Bowen Zhou, Kai Chen

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Fudan University(复旦大学)

AI总结 本文提出InternBootcamp框架,通过可验证的任务扩展方法提升大语言模型的推理能力,为基于强化学习的模型优化、合成数据生成和模型评估提供了基础设施。

Comments InternBootcamp Tech Report

详情
AI中文摘要

大语言模型(LLMs)通过使复杂推理能力成为可能,彻底改变了人工智能领域。尽管最近在强化学习(RL)方面的进展主要集中在领域特定的推理任务(如数学或代码生成),但现实世界中的推理场景往往需要模型处理多样且复杂的环境,而窄领域基准无法完全捕捉这些需求。为解决这一差距,我们提出了InternBootcamp,一个包含1000多个领域多样任务环境的开源框架,专门用于LLM推理研究。我们的代码库提供了两个关键功能:(1)自动生成无限数量的训练/测试案例,具有可配置的难度级别;(2)集成了用于客观响应评估的验证模块。这些功能使InternBootcamp成为基于强化学习的模型优化、合成数据生成和模型评估的基础设施。尽管手动开发具有庞大任务覆盖的框架非常繁琐,但我们通过自动化代理工作流辅以人工验证协议来加速开发过程,使任务范围能够迅速扩展。通过这些训练营,我们进一步建立了Bootcamp-EVAL,一个自动生成的基准,用于全面性能评估。评估显示,前沿模型在许多推理任务上仍表现不足,而使用InternBootcamp进行训练提供了一种有效的方法,显著提高性能,从而得到我们的32B模型,在Bootcamp-EVAL上取得最先进的结果,并在其他已建立的基准上表现出色。特别是,我们验证了在任务扩展方面的一致性能提升来自于包含更多训练任务,即任务扩展,幅度达两个数量级,为能够推理的通用模型提供了有前途的途径。

英文摘要

Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities. While recent advancements in reinforcement learning (RL) have primarily focused on domain-specific reasoning tasks (e.g., mathematics or code generation), real-world reasoning scenarios often require models to handle diverse and complex environments that narrow-domain benchmarks cannot fully capture. To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments specifically designed for LLM reasoning research. Our codebase offers two key functionalities: (1) automated generation of unlimited training/testing cases with configurable difficulty levels, and (2) integrated verification modules for objective response evaluation. These features make InternBootcamp fundamental infrastructure for RL-based model optimization, synthetic data generation, and model evaluation. Although manually developing such a framework with enormous task coverage is extremely cumbersome, we accelerate the development procedure through an automated agent workflow supplemented by manual validation protocols, which enables the task scope to expand rapidly. % With these bootcamps, we further establish Bootcamp-EVAL, an automatically generated benchmark for comprehensive performance assessment. Evaluation reveals that frontier models still underperform in many reasoning tasks, while training with InternBootcamp provides an effective way to significantly improve performance, leading to our 32B model that achieves state-of-the-art results on Bootcamp-EVAL and excels on other established benchmarks. In particular, we validate that consistent performance gains come from including more training tasks, namely \textbf{task scaling}, over two orders of magnitude, offering a promising route towards capable reasoning generalist.

2508.06206 2026-05-21 cs.RO cs.CV

Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model

Affordance-R1: 为多模态大语言模型中的通用化 affordance 推理设计的强化学习

Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, Yuexin Ma

发表机构 * The Hong Kong University of Science and Technology (GZ)(香港科技大学(广州)) National University of Singapore(新加坡国立大学) ShanghaiTech University(上海科技大学) East China Normal University(华东师范大学) Nanjing University of Information Science & Technology(南京信息工程大学) Zhejiang University(浙江大学) Institute of Automation, Chinese Academy of Science(中国科学院自动化研究所) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出 Affordance-R1,一种结合认知 CoT 引导的 Group Relative Policy Optimization (GRPO) 的统一 affordance 地标框架,通过强化学习实现零样本泛化和测试时推理能力。

详情
AI中文摘要

Affordance grounding 旨在预测与机器人执行动作相关的物体特定区域。它在人机交互、人-物交互、具身操作和具身感知领域中起着至关重要的作用。现有模型由于缺乏链式思维(CoT)推理能力,往往忽视不同物体间的 affordance 共享,限制了其域外(OOD)泛化和显式推理能力。为了解决这些挑战,我们提出了 Affordance-R1,这是首个集成认知 CoT 引导的 Group Relative Policy Optimization(GRPO)的统一 affordance 地标框架。具体而言,我们设计了一个复杂的 affordance 函数,包含格式、感知和认知奖励,以有效引导优化方向。此外,我们构建了一个高质量的 affordance 中心推理数据集 ReasonAff,以支持训练。通过仅使用强化学习与 GRPO 进行训练,而不使用显式推理数据,Affordance-R1 实现了稳健的零样本泛化,并表现出涌现的测试时推理能力。全面的实验表明,我们的模型优于已建立的方法,并展示了开放世界泛化能力。据我们所知,Affordance-R1 是首个将基于 GRPO 的 RL 与推理结合到 affordance 推理中的方法。我们的方法和数据集已发布在 https://github.com/hq-King/Affordance-R1。

英文摘要

Affordance grounding focuses on predicting the specific regions of objects that are associated with the actions to be performed by robots. It plays a vital role in the fields of human-robot interaction, human-object interaction, embodied manipulation, and embodied perception. Existing models often neglect the affordance shared among different objects because they lack the Chain-of-Thought(CoT) reasoning abilities, limiting their out-of-domain (OOD) generalization and explicit reasoning capabilities. To address these challenges, we propose Affordance-R1, the first unified affordance grounding framework that integrates cognitive CoT guided Group Relative Policy Optimization (GRPO) within a reinforcement learning paradigm. Specifically, we designed a sophisticated affordance function, which contains format, perception, and cognition rewards to effectively guide optimization directions. Furthermore, we constructed a high-quality affordance-centric reasoning dataset, ReasonAff, to support training. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Affordance-R1 achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Comprehensive experiments demonstrate that our model outperforms well-established methods and exhibits open-world generalization. To the best of our knowledge, Affordance-R1 is the first to integrate GRPO-based RL with reasoning into affordance reasoning. The code of our method and our dataset is released on https://github.com/hq-King/Affordance-R1.

2508.04999 2026-05-21 cs.LG

Disentangling Bias by Modeling Intra- and Inter-modal Causal Attention for Multimodal Sentiment Analysis

通过建模内模和跨模态因果注意力来解构偏见以进行多模态情感分析

Menghua Jiang, Yuxia Lin, Baoliang Chen, Haifeng Hu, Yuncheng Jiang, Sijie Mai

发表机构 * School of Computer Science, South China Normal University(华南师范大学计算机学院) School of Electronics and Information Technology, Sun Yat-sen University(中山大学电子与信息学院)

AI总结 本文提出了一种多关系多模态因果干预(MMCI)框架,通过因果理论的后门调整来解决多模态情感分析中因统计捷径导致的偏见问题,通过建模多模态输入为多关系图并应用注意力机制分离因果特征和捷径特征,从而提升模型在分布偏移下的稳定性。

Comments Corrected several hyperparameter settings. Updated some experimental results

详情
AI中文摘要

多模态情感分析(MSA)旨在通过整合文本、音频和视觉等多种模态的信息来理解人类情感。然而,现有方法常面临模态内部和跨模态的虚假相关性问题,导致模型依赖统计捷径而非真实因果关系,从而影响泛化能力。为缓解此问题,我们提出了一种多关系多模态因果干预(MMCI)框架,该框架利用因果理论中的后门调整方法来处理此类捷径的干扰影响。具体而言,我们首先将多模态输入建模为多关系图,以显式捕捉内模和跨模依赖关系。然后,我们应用注意力机制,分别估计并分离与这些内模和跨模关系对应的因果特征和捷径特征。最后,通过应用后门调整,我们对捷径特征进行分层并动态将其与因果特征结合,以促使MMCI在分布偏移下产生稳定的预测。在多个标准MSA数据集和分布外(OOD)测试集上的大量实验表明,我们的方法有效抑制了偏见并提升了性能。

英文摘要

Multimodal sentiment analysis (MSA) aims to understand human emotions by integrating information from multiple modalities, such as text, audio, and visual data. However, existing methods often suffer from spurious correlations both within and across modalities, leading models to rely on statistical shortcuts rather than true causal relationships, thereby undermining generalization. To mitigate this issue, we propose a Multi-relational Multimodal Causal Intervention (MMCI) framework, which leverages the backdoor adjustment from causal theory to address the confounding effects of such shortcuts. Specifically, we first model the multimodal inputs as a multi-relational graph to explicitly capture intra- and inter-modal dependencies. Then, we apply an attention mechanism to separately estimate and disentangle the causal features and shortcut features corresponding to these intra- and inter-modal relations. Finally, by applying the backdoor adjustment, we stratify the shortcut features and dynamically combine them with the causal features to encourage MMCI to produce stable predictions under distribution shifts. Extensive experiments on several standard MSA datasets and out-of-distribution (OOD) test sets demonstrate that our method effectively suppresses biases and improves performance.

2508.03578 2026-05-21 cs.CV

RadProPoser: Probabilistic Radar Tensor Human Pose Estimation That Knows Its Limits

RadProPoser: 一种具有局限性的概率雷达张量人体姿态估计方法

Jonas Leo Mueller, Lukas Engel, Eva Dorschky, Daniel Krauss, Ingrid Ullmann, Martin Vossiek, Bjoern M. Eskofier

发表机构 * Munich Center for Machine Learning, Germany(慕尼黑机器学习中心,德国)

AI总结 本文提出RadProPoser,一种端到端的概率框架,通过原始雷达张量数据预测三维身体关节及其每个关节的不确定性,该方法在新的基准数据集上实现了6.425厘米的均值位置误差,并通过等调校校准总不确定性。

Comments Accepted at IJCNN 2026 (WCCI, Maastricht)

详情
AI中文摘要

基于雷达的人体姿态估计使环境智能中的隐私保护运动跟踪成为可能,但雷达传感的噪声特性使得不确定性量化至关重要。我们提出了RadProPoser,一种端到端的概率框架,能够从原始雷达张量数据中预测三维身体关节并为每个关节提供不确定性。使用变分编码器-解码器与频谱注意力机制,该方法融合了时间帧中的实部和虚部雷达组件。通过可学习的高斯和拉普拉斯分布,我们建模了aleatoric不确定性。在新的基准数据集上训练,我们的方法实现了6.425厘米的均值位置误差。模型输出每个关节的aleatoric不确定性,等调校校准总不确定性,预期校准误差为0.027。由于频谱注意力机制在个体雷达张量组件上操作,扩展到多雷达配置只需拼接额外的输入流。在双正交雷达的HuPR基准上,该方法实现了5.042厘米的MPJPE。该框架在NVIDIA RTX 3090上以89帧每秒的速度运行,超过了15赫兹雷达帧率。

英文摘要

Radar-based human pose estimation enables privacy-preserving motion tracking for ambient intelligence, yet the noisy nature of radar sensing makes uncertainty quantification essential. We present RadProPoser, an end-to-end probabilistic framework that predicts three-dimensional body joints with per-joint uncertainties from raw radar tensor data. Using a variational encoder-decoder with spectral attention that fuses real and imaginary radar components across temporal frames, we model aleatoric uncertainty through learnable Gaussian and Laplace distributions. Trained on a new benchmark dataset with optical motion-capture ground truth, our method achieves 6.425 cm mean per-joint position error. The model outputs per-joint aleatoric uncertainties, and isotonic recalibration yields calibrated total uncertainty with expected calibration error of 0.027. Since spectral attention operates on individual radar tensor components, extending to multi-radar configurations requires only concatenating additional input streams. On the HuPR benchmark with dual orthogonal radars, this achieves 5.042 cm MPJPE. The framework runs at 89 frames per second (FPS) on an NVIDIA RTX 3090, exceeding the 15 Hz radar frame rate.

2508.02291 2026-05-21 cs.LG cs.AI

FAIR-Pruner: A Flexible Framework for Automatic Layer-Wise Pruning via Tolerance of Difference

FAIR-Pruner: 一种通过差异容忍性实现自动分层剪枝的灵活框架

Chenqing Lin, Mostafa Hussien, Chengyao Yu, Bingyi Jing, Ruixing Ming, Kim Khoa Nguyen, Mohamed Cheriet

发表机构 * School of Statistics and Mathematics, Zhejiang Gongshang University(浙江工商大学统计与数学学院) École de technologie supérieure (ÉTS), Université du Québec(魁北克大学埃克森技术学院) Southern University of Science and Technology(南方科技大学)

AI总结 本文提出FAIR-Pruner,一种无需搜索的自适应分层结构化剪枝框架,通过引入差异容忍度(ToD)来实现非均匀的分层剪枝深度,从而在多个数据集和模型上实现了良好的准确率-压缩率权衡。

Comments Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence

详情
AI中文摘要

结构化剪枝是压缩深度神经网络的标准工具,但其实际性能取决于稀疏性如何分配到各层。我们提出了FAIR-Pruner,一种无需搜索的自适应分层结构化剪枝框架。FAIR-Pruner使用两种在同一层内的排名:一种是去除导向的信号,提出候选单元;另一种是保护导向的信号,识别任务敏感的单元。其核心组件,差异容忍度(ToD),测量去除前缀与保护尾部之间的重叠,并使用共享容忍级别来诱导各层非均匀的剪枝深度。作为默认视觉实例,FAIR-Pruner结合基于Wasserstein的U-Score用于类条件单元分离性,以及基于Taylor的R-Score用于任务级敏感性;相同的ToD分配规则也可以与替代的去除信号配对。理论上,我们通过群体R-Score分析ToD,推导出高R-Score质量进入剪枝集的排名控制,并识别出相同预算比较与均匀剪枝的加法交换条件。在CIFAR-10、CIFAR-100、SVHN和ImageNet上,跨VGG、ResNet、DenseNet、ConvNeXt和DeiT的实验显示了强的准确率-压缩率权衡。在 routed-expert Qwen1.5-MoE-A2.7B-Chat 上的仅剪枝实验进一步检验了在匹配专家预算下的架构扩展性。FAIR-Pruner作为可 pip-install 的开源包发布。

英文摘要

Structured pruning is a standard tool for compressing deep neural networks, but its practical performance depends on how sparsity is allocated across layers. We propose FAIR-Pruner, a search-free framework for adaptive layer-wise structured pruning. FAIR-Pruner uses two within-layer rankings: a removal-oriented signal that proposes candidate units and a protection-oriented signal that identifies task-sensitive units. Its core component, Tolerance of Difference (ToD), measures the overlap between the removal prefix and the protected tail, and uses a shared tolerance level to induce non-uniform pruning depths across layers. As a default vision instantiation, FAIR-Pruner combines a Wasserstein-based U-Score for class-conditional unit separability with a Taylor-based R-Score for task-level sensitivity; the same ToD allocation rule can also be paired with alternative removal signals. Theoretically, we analyze ToD through the population R-Score, derive rank-based control of the high-R-Score mass entering the pruning set, and identify an additive exchange condition for same-budget comparison with uniform pruning. Experiments on CIFAR-10, CIFAR-100, SVHN, and ImageNet across VGG, ResNet, DenseNet, ConvNeXt, and DeiT show strong accuracy--compression trade-offs. Prune-only experiments on routed-expert Qwen1.5-MoE-A2.7B-Chat further examine architectural extensibility under matched expert budgets. FAIR-Pruner is released as a pip-installable open-source package.

2507.23313 2026-05-21 cs.CV

The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

伦勃朗的牛 - 分析文本到图像模型中艺术提示的解释

Alfio Ferrara, Sergio Picascia, Elisabetta Rocchetti

发表机构 * Department of Computer Science, Università degli Studi di Milano, Via Celoria, 18, 20133 Milan, Italy(米兰大学计算机科学系)

AI总结 本文研究了文本到图像扩散模型在生成艺术作品时如何解释内容和风格的概念,通过交叉注意力热图分析生成图像中像素与特定提示词的关联,揭示了不同艺术提示和风格下内容与风格分离的程度,为理解大规模生成模型内部如何表示复杂艺术概念提供了新见解。

Comments to be published in: Applications of AI in the Analysis of Cultural and Artistic Heritage, organized within the 35th IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2025

详情
AI中文摘要

文本到图像扩散模型通过学习数十亿张图像,在生成艺术内容方面展现了显著的能力,包括流行艺术作品。然而,这些模型如何内部表示概念,如绘画中的内容和风格,这一基本问题仍未被探索。传统计算机视觉假设内容和风格是正交的,但扩散模型在训练过程中并未获得关于这一区别的显式指导。在本文中,我们研究了基于Transformer的文本到图像扩散模型在生成艺术作品时如何编码内容和风格概念。我们利用交叉注意力热图将生成图像中的像素归因于特定的提示词,使我们能够隔离受内容描述词和风格描述词影响的图像区域。我们的发现表明,扩散模型在不同艺术提示和风格请求下表现出不同程度的内容-风格分离。在许多情况下,内容词主要影响物体相关区域,而风格词影响背景和纹理区域,这表明模型对内容-风格区别的理解是涌现的。这些见解有助于理解大规模生成模型如何在没有显式监督的情况下内部表示复杂的艺术概念。我们分享了代码、数据集以及用于可视化注意力地图的探索工具,地址为https://github.com/umilISLab/artistic-prompt-interpretation。

英文摘要

Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.

2507.09180 2026-05-21 cs.CV cs.RO

Multimodal Fusion for Sim2real Transfer in Visual Reinforcement Learning

多模态融合用于视觉强化学习中的仿真到现实迁移

Zichun Xu, Jingdong Zhao, Chenyu Guo, Qianxue Zhang, Liao Zhang, Xiao Zhang, Yiming Ren, Lian Zhang, Zengren Zhao

发表机构 * Medical Artificial Intelligence Lab, The First Hospital of Hebei Medical University, Hebei Medical University(医学人工智能实验室,河北医科大学第一医院,河北医科大学) State Key Laboratory of Robotics and Systems, Harbin Institute of Technology(机器人系统国家重点实验室,哈尔滨工业大学)

AI总结 本文提出基于视觉变换器的多模态融合框架,通过融合RGB和深度信息提升泛化能力,并设计对比学习方案和课程式域随机化方案以提高样本效率和迁移性能,实验结果表明该方法在现实任务中表现优异。

详情
AI中文摘要

深度信息对场景外观变化具有鲁棒性,并固有地包含3D空间细节。因此,本文提出基于视觉变换器的视觉主干,用于融合RGB和深度模态以增强泛化能力。不同模态首先通过单独的CNN茎部进行处理,结合的卷积特征被送入可扩展的视觉变换器以获得视觉表示。此外,设计了一种对比学习方案,通过掩码和未掩码的token来提高样本效率和泛化性能。采用基于课程的域随机化方案以灵活稳定训练过程。最后,仿真结果表明,我们的融合方案优于其他基线。通过零样本迁移验证了模型的可行性,能够执行现实世界操作任务。

英文摘要

Depth information is robust to scene appearance variations and inherently carries 3D spatial details. Thus, a visual backbone based on the vision transformer is proposed to fuse RGB and depth modalities for enhancing generalization in this paper. Different modalities are first processed by separate CNN stems, and the combined convolutional features are delivered to the scalable vision transformer to obtain visual representations. Moreover, a contrastive learning scheme is designed with masked and unmasked tokens to enhance the sample efficiency and generalization performance. A curriculum-based domain randomization scheme is used to flexibly stabilize the training process. Finally, simulation results demonstrate that our fusion scheme outperforms the other baselines. The feasibility of our model is validated to perform real-world manipulation tasks via zero-shot transfer.

2506.21039 2026-05-21 cs.LG cs.AI

Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning

严格子目标执行:在分层强化学习中的可靠长 horizon 规划

Jaebak Hwang, Sanghyeon Lee, Jeongmo Kim, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence(人工智能研究生院) Ulsan National Institute of Science and Technology (UNIST)(釜山国立科学与技术研究所) Ulsan, South Korea(韩国釜山)

AI总结 本文提出严格子目标执行(SSE)框架,通过前沿经验回放(FER)分离不可达与可接受的子目标,提高高层决策效率,从而在长horizon任务中实现更可靠的规划。

Comments 10 pages for main, 26 pages for total, Accepted to ICLR 2026

Journal ref International Conference on Learning Representations (ICLR), 2026

详情
AI中文摘要

长horizon目标条件任务对强化学习(RL)提出了根本性挑战,特别是在目标遥远且奖励稀疏的情况下。虽然分层和图基方法提供了部分解决方案,但它们对传统 hindsight relabeling 的依赖往往无法纠正子目标不可行性,导致高层规划效率低下。为此,我们提出严格子目标执行(SSE),一种基于图的分层RL框架,整合前沿经验回放(FER)以分离不可达与可接受的子目标,并优化高层决策。FER利用失败和部分成功转移确定可达性前沿,识别不可靠的子目标,提高子目标可靠性,并减少不必要的高层决策。此外,SSE采用解耦探索策略以覆盖目标空间的未探索区域,并通过路径细化调整边成本以利用观察到的低层失败。在多样化的长horizon基准测试中,SSE在效率和成功率方面均优于现有目标条件和分层RL方法。我们的代码可在 https://jaebak1996.github.io/SSE/ 上获得。

英文摘要

Long-horizon goal-conditioned tasks pose fundamental challenges for reinforcement learning (RL), particularly when goals are distant and rewards are sparse. While hierarchical and graph-based methods offer partial solutions, their reliance on conventional hindsight relabeling often fails to correct subgoal infeasibility, leading to inefficient high-level planning. To address this, we propose Strict Subgoal Execution (SSE), a graph-based hierarchical RL framework that integrates Frontier Experience Replay (FER) to separate unreachable from admissible subgoals and streamline high-level decision making. FER delineates the reachability frontier using failure and partial-success transitions, which identifies unreliable subgoals, increases subgoal reliability, and reduces unnecessary high-level decisions. Additionally, SSE employs a decoupled exploration policy to cover underexplored regions of the goal space and a path refinement that adjusts edge costs using observed low-level failures. Experimental results across diverse long-horizon benchmarks show that SSE consistently outperforms existing goal-conditioned and hierarchical RL methods in both efficiency and success rate. Our code is available at https://jaebak1996.github.io/SSE/

2506.17631 2026-05-21 cs.LG cs.AI

Time-Prompt: Integrated Heterogeneous Prompts for Unlocking LLMs in Time Series Forecasting

Time-Prompt: 集成异构提示以解锁时间序列预测中的LLM

Zesen Wang, Lijuan Lan, Yonggang Li

发表机构 * Central South University, Changsha, China(中南大学,长沙,中国)

AI总结 本文提出Time-Prompt框架,通过构建统一的提示范式、设计语义空间嵌入和跨模态对齐模块以及高效微调LLM参数,提升时间序列预测性能,并在碳排放数据集上验证其有效性。

Comments Accepted at IJCNN 2026

详情
AI中文摘要

时间序列预测旨在建模变量间的时序依赖关系以推断未来状态,对现实世界场景具有重要性和广泛应用。尽管基于深度学习的方法已取得显著进展,但其在长期预测中仍表现不佳。最近研究表明,大型语言模型(LLMs)在时间序列预测中表现出色,但其在该任务中的实用性仍存疑。为此,我们提出Time-Prompt框架,旨在激活LLMs进行时间序列预测。具体而言,我们首先构建了一个统一的提示范式,利用可学习的软提示引导LLM的行为,并利用文本化的硬提示增强时间序列表示。其次,为了增强LLM对预测任务的全面理解,我们设计了一个语义空间嵌入和跨模态对齐模块,以实现时序和文本数据的融合。最后,我们利用时间序列数据高效地微调LLM的参数。此外,我们专注于碳排放领域,旨在为全球碳中和做出贡献。在6个公开数据集和3个碳排放数据集上的综合评估表明,Time-Prompt是一个强大的时间序列预测框架。

英文摘要

Time series forecasting aims to model temporal dependencies among variables for future state inference, holding significant importance and widespread applications in real-world scenarios. Although deep learning-based methods have achieved remarkable progress, they still exhibit suboptimal performance in long-term forecasting. Recent research demonstrates that large language models (LLMs) achieve promising performance in time series forecasting, but this progress is still met with skepticism about whether LLMs are truly useful for this task. To address this, we propose Time-Prompt, a framework for activating LLMs for time series forecasting. Specifically, we first construct a unified prompt paradigm with learnable soft prompts to guide the LLM's behavior and textualized hard prompts to enhance the time series representations. Second, to enhance LLM' comprehensive understanding of the forecasting task, we design a semantic space embedding and cross-modal alignment module to achieve fusion of temporal and textual data. Finally, we efficiently fine-tune the LLM's parameters using time series data. Furthermore, we focus on carbon emissions, aiming to provide a modest contribution to global carbon neutrality. Comprehensive evaluations on 6 public datasets and 3 carbon emission datasets demonstrate that Time-Prompt is a powerful framework for time series forecasting.

2505.19075 2026-05-21 cs.AI cs.CL cs.LG

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

Universal Reasoner: 一个单一、可组合的即插即用推理器用于冻结的LLM

Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, Jong Chul Ye

发表机构 * Graduate School of Artificial Intelligence, Korea Advanced Institute of Science and Technology(人工智能研究生院,韩国科学技术院)

AI总结 本文提出Universal Reasoner,一种可组合且即插即用的推理模块,能够在冻结的大规模语言模型上提供专门的推理能力,通过共享或对齐的token空间实现弱到强的泛化,实验表明其在数学推理和机器翻译中优于现有微调方法。

Comments ICML 2026

详情
AI中文摘要

Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods. Code is open-sourced at https://github.com/hangeol/UniR.

英文摘要

Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically require retraining for each LLM backbone due to architectural dependencies. To address these challenges, we propose Universal Reasoner (UniR)-a modular, composable, and plug-and-play reasoning module that can be used with larger frozen LLMs to provide specialized reasoning capabilities with a shared or aligned token space. Specifically, UniR decomposes the reward into a standalone reasoning module trained in a decoupled manner using verifiable rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR is combined with frozen LLMs at inference time by simply adding its output logits to those of the backbone. This additive structure enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Furthermore, UniR demonstrates weak-to-strong generalization, where reasoning modules trained on smaller models effectively guide much larger LLMs in the same model family, and generalize across domains such as in vision language models and medical reasoning. Experiments on mathematical reasoning and machine translation show that UniR surpasses existing fine-tuning methods. Code is open-sourced at https://github.com/hangeol/UniR.

2505.14654 2026-05-21 cs.CV cs.AI cs.CL

Beyond Words: Multimodal LLM Knows When to Speak

超越词语:多模态大语言模型何时说话

Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin

发表机构 * Department of Computer Science, Stony Brook University(石溪大学计算机科学系) Atmee AI

AI总结 本文提出了一种多模态策略,通过同步视频、音频和文本线索提高对话中的响应时机意识,从而提升大语言模型在对话中的响应准确性。

Comments Project page: https://github.com/lzk901372/MM-When2Speak

详情
AI中文摘要

基于大语言模型(LLMs)的聊天机器人能够生成流畅的响应,但在何时发言的问题上常常遇到困难,尤其是在对话过程中需要简短及时的听众反应时。我们提出了一种多模态策略,利用同步的视频、音频和文本线索来改进对话中的时间感知能力。该策略将响应时间重新表述为密集响应类型预测任务,使智能体能够在流式约束下决定是否保持沉默、生成简短反应或开始完整响应。因此,我们引入了一个经过精心挑选的多模态数据集,该数据集来自真实世界的双人对话视频,具有时间对齐的多模态数据和细粒度的反应类型注释。此外,我们设计了一种多模态策略MM-When2Speak,在LLM骨干网络上增加了多模态集成模块。在各种模态设置和强大的LLM基线模型上的实验表明,MM-When2Speak在响应类型预测性能上实现了高达3倍的提升,突显了多模态感知在自然和吸引人的对话交互中的重要性。

英文摘要

Chatbots via large language models (LLMs) generate fluent responses but often struggle with when to speak, especially for brief, timely listener reactions during ongoing dialogue. We present a multimodal strategy for LLMs, which leverages synchronized video, audio, and text cues to improve conversational timing awareness. The strategy reformulates response timing as a dense response-type prediction task, enabling an agent to decide whether to remain silent, produce a short reaction, or start a full response under streaming constraints. Therefore, we introduce a curated multimodal dataset from real-world dyadic conversational videos with temporally aligned modalities and fine-grained reaction type annotations. Moreover, we design a multimodal strategy, MM-When2Speak, with a multimodal integration module on top of an LLM backbone. Experiments across various modality settings and strong LLM baselines show that MM-When2Speak achieves up to a 3x improvement in response type prediction performance, highlighting the importance of multimodal perception for natural and engaging conversational interaction.