arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
专题追踪
2605.09708 2026-05-12 cs.LG cs.AI cs.DC

Metal-Sci: A Scientific Compute Benchmark for Evolutionary LLM Kernel Search on Apple Silicon

Víctor Gallego

发表机构 * Komorebi AI Technologies(Komorebi人工智能技术)

AI总结 本文提出 Metal-Sci,一个用于在苹果芯片上评估进化型大语言模型(LLM)内核搜索性能的科学计算基准,涵盖六个优化场景的十项任务。该基准结合了轻量级框架,能够自动编译候选内核并评估其性能,同时通过结构化诊断反馈给固定LLM,驱动进化搜索过程。研究显示,使用 Claude、Gemini 和 GPT 等模型在 M1 Pro 上进行内核搜索,可实现最高达 10.7 倍的性能提升,并提出了一种基于保留测试集的评分函数,用于检测模型在未知场景下的性能退化问题。

Comments Preprint

详情
英文摘要

We present Metal-Sci, a 10-task benchmark of scientific Apple Silicon Metal compute kernels spanning six optimization regimes (stencils, all-pairs in $n$-body problems, multi-field Boltzmann, neighbor-list molecular dynamics, multi-kernel PDE, FFT). Each task ships a CPU reference, a roofline-anchored fitness function, and a held-out generalization size. We pair the benchmark with a lightweight harness for automatic kernel search that runtime-compiles each candidate, scores it against the roofline across multiple sizes, and feeds structured compile and per-size correctness diagnostics back to a frozen LLM driving a $(1{+}1)$ evolutionary loop. We report matched single-model sweeps of Claude Opus 4.7, Gemini 3.1 Pro, and GPT 5.5 on M1 Pro: in-distribution self-speedups span $1.00\times$ to $10.7\times$. Beyond raw speedup, our central methodological claim is structural: the held-out gate scoring function $Φ_\mathcal{T}$ (evaluated once at end-of-run on a configuration the agent never sees during search) functions as a cheap mechanical oversight primitive on this automatic search loop, catching e.g. an Opus template <uint D> HMC win that returns wrong samples at unseen dimensions, and a GPT FFT3D best that wins in-distribution at $2.95\times$ speedup but collapses to $0.23\times$ on a $256^3$ held-out cube, a silent regression that the in-distribution score alone cannot see. Code at https://github.com/vicgalle/metal-sci-kernels

2605.09707 2026-05-12 cs.LG cs.AI

Adaptive Data Harvesting for Efficient Neural Network Learning with Universal Constraints

Siteng Kang, Xinhua Zhang

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 本文研究了在连续域上训练满足通用约束的神经网络所面临的问题,如李雅普诺夫神经网络和物理信息神经网络,这类问题通常缺乏解析解或约束过于严格。为解决这一问题,作者提出了一种基于强化学习的自适应数据采集方法,通过从数据和经验中学习动态调整样本,以提升模型训练的效率和约束满足能力。该方法在多种任务中验证有效,展示了其在需要自适应输入选择的训练场景中的广泛适用性。

Comments Preprint

详情
英文摘要

Training neural networks to satisfy universal constraints over continuous domains poses unique challenges. Common examples include Lyapunov Neural Networks (Lyapunov NNs) and Physics-Informed Neural Networks (PINNs), where analytical solutions are generally either unavailable or overly restrictive. Sample-based methods are therefore commonly used to enforce these constraints, and the choice of samples has a substantial impact on convergence speed, stability, and solution quality. Most existing methods rely on fixed heuristics or handcrafted rules, and are suboptimal in practice. In this paper, we aim to improve upon them by learning, from data and experience, how to dynamically and iteratively adjust the samples in response to the model's evolving learning performance. Trained by reinforcement learning, the learned policy improves empirical constraint satisfaction on test problems while significantly improving efficiency. We validate the approach on both Lyapunov NNs and PINNs, and demonstrate its broader applicability to domains where adaptive input selection is essential for effective training.

2605.09703 2026-05-12 cs.CV

MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding

Xiaoyu Yuan, Niklas Heikkala, Tiina Törmänen, Hanna Järvenoja, Guoying Zhao, Haoyu Chen

发表机构 * University of Oulu(奥卢大学)

AI总结 本文提出MOTOR-Bench,一个用于零样本人类心理状态理解的现实场景数据集与多智能体框架。该数据集包含1,440个协作学习场景的多模态视频片段,每个样本由教育专家基于自我调节学习理论标注,旨在支持对复杂人际互动的结构化分析。为解决现有方法在从可观测行为推理深层心理状态方面的不足,研究提出了MOTOR-MAS多智能体框架,通过结构化协调机制提升对行为、认知和情绪三类标签的预测性能,实验表明其在多项指标上显著优于现有方法。

Comments Accepted by CVPR 2026 workshop AI4RWC

详情
英文摘要

Understanding human mental states from natural behavior is crucial for intelligent systems in the real world. However, most current research focuses on predicting isolated mental state labels, lacking structured annotations of complex interpersonal interactions. To support structured analysis, we introduce MOTOR-Bench, a carefully-designed benchmark with a real-world dataset MOTOR-dataset, containing 1,440 multimodal video clips in collaborative learning scenarios, reflecting key real-world data challenges including natural class imbalance, visual noise, and domain-specific language. Each sample is labeled by educational experts based on self-regulated learning theory. We further evaluate several state-of-the-art multimodal large language models and multi-agent systems in a zero-shot setting on our MOTOR-Bench. However, their performance on this task remains limited, suggesting that existing methods still struggle with structured reasoning from observable behavior to deeper mental states. To address this challenge, we propose a reasoning multi-agent framework, named MOTOR-MAS. It coordinates multiple agents through a structured agent coordination mechanism to infer explicit behaviors, internal cognitions, and psychological emotions. Experimental results show that our MOTOR-MAS outperforms the best single-model benchmark by 15.93 points in Macro-F1 scores for the three labels of behavior, cognition, and emotion, and outperforms the general multi-agent benchmark by 10.2 points in internal cognition prediction.

2605.09701 2026-05-12 cs.CV

DriveFuture: Future-Aware Latent World Models for Autonomous Driving

Yufeng Hong, Xiaotian Zhou, Yingyan Li, Xiangpo Zhou, Lin Liu, Yadan Luo, Shaoqing Xu, Lei Yang, Ziying Song

发表机构 * Beijing Institute of Technology(北京理工大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beihang University(北航) Beijing Jiaotong University(北京交通大学) The University of Queensland(昆士兰大学) University of Macau(澳门大学) Nanyang Technological University(南洋理工大学) School of Artificial Intelligence ( School of Software), Yanshan University(燕山大学人工智能学院(软件学院))

AI总结 DriveFuture 是一种面向自动驾驶的未来感知潜在世界模型,其核心在于将未来世界状态作为当前潜在状态建模的条件,从而显式学习面向路径规划的前瞻性能力。该方法在训练过程中通过预测和优化未来潜在状态,为基于扩散模型的轨迹规划器提供显式条件,在多个公开基准测试中取得了领先的性能表现。实验结果表明,将未来状态作为当前决策的条件,比单纯预测未来状态更能提升自动驾驶系统的智能化水平。

Comments 24pages, 7 figures

详情
英文摘要

Existing latent world models for autonomous driving have opened a promising path toward future-aware driving intelligence. However, they typically treat future latent states as prediction targets or auxiliary signals, rather than directly conditioning trajectory planning. This can entangle current and future features in latent space. In this work, we propose DriveFuture, a future-aware latent world modeling framework for autonomous driving that explicitly learns planning-oriented foresight by conditioning the current latent state modeling process on future world states. Specifically, during training, the model first predicts future latent world states from the current latent state and ego action, and then refines the prediction against the ground-truth future latent state via cross-attention. The resulting future-aware latent serves as an explicit condition for a diffusion-based trajectory planner. During inference, DriveFuture conditions on the predicted future latent state instead of the ground-truth future state. DriveFuture achieves SOTA performance on the public NAVSIM benchmarks, reaching \textbf{55.5} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}, \textbf{89.9} EPDMS on NAVSIM-v2 {\textcolor{blue}{\textit{navtest}}}, and \textbf{90.7} PDMS on NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}, respectively. These results suggest that the key to latent world modeling lies not merely in simulating future states, but more importantly in conditioning current decision-making on future states. Notably, as of April 2026, DriveFuture ranks \textbf{1st} on the \href{https://huggingface.co/spaces/AGC2025/e2e-driving-navhard}{NAVSIM-v2 {\textcolor{blue}{\textit{navhard}}}} leaderboard and achieves SOTA performance on \href{https://huggingface.co/spaces/AGC2024-P/e2e-driving-navtest}{NAVSIM-v1 {\textcolor{blue}{\textit{navtest}}}}.

2605.09698 2026-05-12 cs.AI

Ambig-DS: A Benchmark for Task-Framing Ambiguity in Data-Science Agents

Josefa Lia Stoisser, Marc Boubnovski Martell, Sidsel Boldsen, Kaspar Märtens, Robert Kitchen

发表机构 * Novo Nordisk London, UK(诺和制药伦敦分公司)

AI总结 随着数据科学代理从辅助工具向自主系统转变,任务框架的隐性错误成为关键失效模式。本文提出 Ambig-DS 基准,用于评估数据科学代理在任务目标和评估目标模糊情况下的表现,包含两个诊断套件,分别基于 DSBench 和 MLE-bench 构建。研究发现,代理常在未明确任务的情况下提交错误答案,而并非执行错误,并且在允许提问时性能显著提升,但代理难以判断何时需要提问,反映出当前评估体系对任务框架识别能力的忽视。

详情
英文摘要

As data-science agents shift from co-pilots to auto-pilots, silent misframing becomes a critical failure mode. Agents quietly commit to plausible but unintended task framings, producing clean, executable artifacts that hide their incorrect assessment of the task. Existing benchmarks score whether the pipeline runs, ignoring whether the agent recognized the task was underspecified. We introduce Ambig-DS, two diagnostic suites: one for prediction-target ambiguity (Ambig-DS-Target, 51 tasks built on DSBench, a tabular modeling benchmark) and one for evaluation-objective ambiguity (Ambig-DS-Objective, 61 tasks built on MLE-bench, a Kaggle-style ML competition benchmark), constructed so that scoring uses each source benchmark's original evaluator. For every task we pair the original, fully specified version with an ambiguous variant produced by controlled edits; a human-and-LLM verification pipeline confirms each variant admits multiple plausible interpretations with decision-relevant consequences. The suites are analyzed independently and ambiguity lowers performance in both. Across five agents spanning efficient to frontier-class models, we find in our controlled diagnostic setting: (i) failures are silent commitments: wrong-target submissions on Target, wrong-metric or non-committal baseline submissions on Objective, rather than execution errors; (ii) allowing the agent to ask one clarifying question recovers much of the loss under idealized conditions, suggesting missing framing information drives a substantial part of the observed degradation; but (iii) agents cannot reliably tell when to use it: permissive prompts induce over-asking on clear tasks, while conservative prompts induce silent defaulting on ambiguous ones. Recognizing target and objective underspecification, not pipeline execution, is the bottleneck missing from standard DS-agent evaluations.

2605.09696 2026-05-12 cs.LG cs.NE cs.SC

Discovery of Nonlinear Dynamics with Automated Basis Function Generation

Mohammad Amin Basiri, Charles Nicholson

发表机构 * Data Science and Analytics Institute, University of Oklahoma, Norman, OK, USA(数据科学与分析研究所,俄克拉荷马大学,诺曼,OK,USA) School of Industrial and Systems Engineering, University of Oklahoma, Norman, OK, USA(工业与系统工程学院,俄克拉荷马大学,诺曼,OK,USA)

AI总结 从观测数据中发现支配方程是科学建模中的一个基本挑战,尤其当系统背后的数学结构未知时。本文提出了一种名为AutoSINDy的混合框架,结合符号回归的探索能力和SINDy的稀疏性促进能力,通过分阶段的自动基函数生成与筛选,有效提升了模型发现的准确性与鲁棒性。实验表明,该方法在高噪声环境下仍能高效恢复真实方程,显著优于传统方法。

Comments 53 pages, 17 figures. Code available at https://github.com/mabasiri95/AutoSINDy

详情
英文摘要

Discovering governing equations from observational data remains a fundamental challenge in scientific modeling, particularly when the underlying mathematical structure is unknown. Traditional sparse identification methods like SINDy excel at discovering parsimonious models but require researchers to specify candidate basis functions a priori, a limitation that often leads to model failure when critical terms are omitted or when systems exhibit unconventional dynamics. Purely symbolic regression approaches offer unlimited flexibility but struggle with noise sensitivity and frequently produce overly complex, unstable equations. We present AutoSINDy, a hybrid Discovery-then-Solve framework that combines the exploratory power of symbolic regression with the robust sparsity-promoting capabilities of SINDy. Our method operates in three stages: (1) PySR-based symbolic regression discovers candidate functional forms from bootstrapped data chunks; (2) a curation pipeline decomposes, expands, and filters these expressions using collinearity analysis to construct a minimal yet comprehensive library; and (3) SINDy identifies sparse governing equations from this custom-tailored library. Extensive experiments across canonical nonlinear systems demonstrate that AutoSINDy consistently recovers ground-truth equations even under high observational noise, achieving a ground-truth recovery rate of 92.8% across all trials. Compared with standard SINDy using enriched libraries and standalone symbolic regression, AutoSINDy achieves higher predictive accuracy, superior generalization to unseen trajectories, and substantially lower symbolic complexity.

2605.09693 2026-05-12 cs.CV cs.AI cs.LG

Do multimodal models imagine electric sheep?

Santhosh Kumar Ramakrishnan, Carl Vondrick, Raja Giryes, Philipp Krähenbühl, Vladlen Koltun

发表机构 * Apple(苹果公司)

AI总结 该研究探讨了多模态模型在解决空间谜题时是否会产生心理意象,并发现大型多模态模型在解决如拼图、积木等任务时确实会形成类似“想象”的过程,甚至在解决与羊相关的谜题时会“想象”出羊的形象。研究通过微调Qwen3.5视觉语言模型,使其能够完成多种视觉推理任务,并发现模型在执行操作过程中会自发形成对中间状态的视觉表征。基于这一发现,研究提出了两种方法来增强和利用模型的内部视觉表征,显著提升了任务解决的准确率。

详情
英文摘要

Yes. We find that large multimodal models develop mental imagery when solving spatial puzzles, and they do imagine sheep when solving sheep puzzles. We fine-tune a Qwen3.5 VLM to solve twelve diverse visual reasoning tasks -- including tangram, jigsaw, sokoban, 3D mental rotation, and rush hour -- that require understanding geometry, spatial relationships, and the consequences of actions. By supervising the model to predict the open-loop sequence of actions to solve a puzzle from an initial state, we show that the model's activations after each action encode meaningful visual information about the intermediate state. This finding suggests that an imperfect visual world model begins to form as a byproduct of learning to select correct actions, in the absence of any explicit visual supervision. Building on this observation, we propose two ways to sharpen and use the mental images formed by the model. We find that integrating as few as sixteen visual tokens per step into the chain of thought improves the average solve rate from 83% to 89%, with particularly strong gains on reasoning-heavy tasks such as jigsaw and 3D mental rotation.

2605.09691 2026-05-12 cs.LG

Quantum Circuit Simulation of Compartmental Drug Dynamics: Leveraging Variational Algorithms for Nonlinear Mixed-Effects Population Pharmacokinetics

Isshaan Singh, Nandan Patel

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) Vellore Institute of Technology(维洛雷理工学院)

AI总结 本文将传统的药物动力学(PK/PD)模型转化为开放量子系统,并利用量子电路进行模拟,以提升群体药代动力学建模的统计性能。研究通过十二个量子比特编码四个药理学腔室,并使用受控量子操作模拟腔室间的随机转移过程。实验表明,该量子方法在对数似然值上优于经典方法,同时保持参数估计一致,验证了模型的统计拟合能力和数值稳定性,为生物医学领域提供了新的量子-经典混合建模方法。

详情
英文摘要

Population pharmacokinetic/pharmacodynamic (PK/PD) modeling traditionally relies on classical ordinary differential equations to simulate drug dynamics. In this work, we reformulate a compartmental PK/PD model as an open quantum system and implement it using quantum circuits developed in PennyLane. Four pharmacological compartments (central, peripheral, effect-site, and response) are encoded using twelve qubits, with inter-compartmental transitions represented through controlled quantum operations that emulate stochastic dynamics. The framework is evaluated on Phase 1 clinical data using a quantum-enhanced stochastic approximation expectation-maximization (SAEM) approach. Compared with the classical implementation, the quantum model achieves substantially improved log-likelihood values, indicating stronger statistical fit while preserving identical parameter estimates, thereby validating numerical consistency and model interpretability. The quantum-based optimization converges faster in terms of iterations, although total runtime is increased due to current simulation overhead. The study demonstrates stable large-scale simulation performance and establishes a hybrid quantum-classical approach that maintains biological fidelity while improving statistical modeling capacity. The dataset and problem statement originate from the Quantum Innovation Challenge 2025, and additional details are provided via the associated link.

2605.09688 2026-05-12 cs.CV

ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes

Rui Song, Tianhui Cai, Markus Gross, Xingcheng Zhou, Zewei Zhou, Zhiyu Huang, Olaf Wysocki, Jiaqi Ma

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) University of Cambridge(剑桥大学) Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出了一种名为 ConFixGS 的方法,用于修复基于前馈的3D高斯泼溅(3DGS)在驾驶场景中的重建问题。该方法利用置信度感知的扩散先验,通过生成局部伪目标并结合支持视图的重投影校验,提升重建的细节可靠性并抑制不一致信息。实验表明,ConFixGS 在多个数据集上显著提升了新视角合成效果,PSNR 提升最高达3.68 dB,FID 减少近一半,展示了其在驾驶场景中鲁棒重建的有效性。

Comments 28 pages, 12 figures

详情
英文摘要

Feedforward 3D Gaussian Splatting (3DGS) often struggles in trajectory-based sparse-view driving scenes. Existing Gaussian repair methods mainly target optimization-based 3DGS, while diffusion-based repair is typically restricted to iterative refinement near observed viewpoints, leaving feedforward 3DGS repair underexplored. We propose ConFixGS, a plug-and-play method that learns to fix feedforward 3DGS with confidence-aware diffusion priors. Starting from a pretrained feedforward model, ConFixGS generates diffusion-enhanced local pseudo-targets and validates them through reprojection-based cross-checking against support views. The resulting dense confidence maps guide refinement, enhancing reliable details while suppressing hallucinated or inconsistent evidence. On Waymo, nuScenes, and KITTI, ConFixGS improves challenging novel view synthesis, with PSNR gains of up to 3.68 dB and FID reduced by nearly half. Our results highlight confidence-aware fusion of generative priors and support-view consistency as a key principle for robust feedforward 3D driving scene reconstruction.

2605.09687 2026-05-12 cs.CV

Spatial-Frequency Gated Swin Transformer for Remote Sensing Single-Image Super-Resolution

Md Aminur Hossain, Parekh Valkesh, Ayush V. Patel, Yogesh Jethani, Sanjay K. Singh, Biplab Banerjee

发表机构 * Space Applications Centre, ISRO, Ahmedabad, India(印度航天研究组织阿赫迈德亚布德研究中心) Centre of Studies in Resources Engineering, Indian Institute of Technology Bombay, India(印度理工学院孟买资源工程研究学院) New L J Institute of Engineering and Technology, Ahmedabad, India(阿赫迈德亚布德新LJ工程与技术学院) Pandit Deendayal Energy University, Gandhinagar, India(潘迪特·德恩达尔能源大学) GLS University, Ahmedabad, India(阿赫迈德亚布德GLS大学)

AI总结 本文研究了遥感单图像超分辨率问题,旨在从低分辨率观测中重建高分辨率图像并保留精细的空间结构。为了解决现有Swin Transformer模型在细节重建上的不足,作者提出了一种空间-频率门控Swin Transformer(SFG-SwinSR),通过在前馈网络中引入空间-频率门控模块,分离低频结构内容与高频残差细节,从而提升重建质量。实验表明,该方法在多个遥感数据集上取得了更好的PSNR和SSIM指标,有效增强了高分辨率图像的细节表现。

Comments 15 pages

详情
英文摘要

Remote Sensing (RS) single-image super-resolution aims to reconstruct high-resolution imagery from low-resolution observations while preserving fine spatial structures. Recent Swin Transformer-based models, including Swin2SR, provide strong spatial context modeling throughshifted-window self-attention, but their feed-forward networks remain generic channel-mixing modules and do not separate low-frequency structural content from high-frequency residual detail. To address this limitation, we propose SFG-SwinSR, a Spatial-Frequency Gated Swin Transformer for single-image super-resolution in remote sensing. SFG-SwinSR modifies the original Swin2SR attention block by replacing each transformer block's standard feed-forward network with a lightweight Spatial-Frequency Gated Feed-Forward Network (SFG-FFN). The module estimates low-frequency content via a depthwise-blur branch, extracts high-frequency residuals by subtraction, refines them with a lightweight spatial branch, and adaptively injects detail through a bottleneck gate. Experiments on SpaceNet and SEN2VENμS show that SFG-SwinSR improves reconstruction quality under the evaluated settings. On SpaceNet, it achieves 45.19 dB PSNR and 0.9852 SSIM, indicating effective enhancement of high-frequency details. This demonstrates that spatial-frequency transformation within the transformer feed-forward network improves detail reconstruction in RS super-resolution.

2605.09685 2026-05-12 cs.LG cs.AI

Learning Unified Representations of Normalcy for Time Series Anomaly Detection

Prithul Sarker, Sushmita Sarker, Nicholas G. Murray, Alireza Tavakkoli

发表机构 * University of Nevada, Reno(内华达大学里诺分校)

AI总结 本文研究了无监督时间序列异常检测中的核心问题——在缺乏异常特征先验知识的情况下,如何学习区分正常数据分布的鲁棒表示。为此,作者提出了一种新的统一无监督异常检测框架 $\text{U}^2\text{AD}$,该方法基于分数生成模型学习正常数据的潜在分布,并引入了时间依赖的分数网络和统一的训练目标,以同时捕捉局部和全局时间上下文信息。实验表明,该方法在检测准确率和异常早期识别能力方面均优于现有先进方法。

详情
英文摘要

The core challenge in unsupervised anomaly detection is identifying abnormal patterns without prior knowledge of their characteristics. While existing methods have addressed aspects of this problem, they often struggle to learn a robust representation of the normal data distribution that is distinct from anomalous patterns. In this paper, we present a novel framework, Unified Unsupervised Anomaly Detection ($\text{U}^2\text{AD}$), that comprehensively addresses anomaly detection in multivariate time series. Our approach learns the underlying data distribution of normal samples by utilizing score-based generative modeling. We introduce a novel time-dependent score network and a unified training objective that together delineate the manifold of normal data while considering both local and global temporal contexts. Reconstruction is then performed via a deterministic sampling process using an ordinary differential equation solver. Our extensive experimental evaluations demonstrate that $\text{U}^2\text{AD}$ not only outperforms current state-of-the-art methods in detection accuracy but also identifies anomalies at significantly earlier stages of their occurrence.

2605.09681 2026-05-12 cs.CV

Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

Yicheng Ji, Zhizhou Zhong, Jun Zhang, Qin Yang, XiTai Jin, Ying Qin, Wenhan Luo, Shuiyang Mao, Wei Liu, Huan Li

发表机构 * ZJU(浙江大学) Video Rebirth(视频重生) HKUST(香港科技大学) BJTU(北京理工大学)

AI总结 本文针对自回归视频扩散模型中因冗余键值(KV)缓存导致的注意力复杂度高和内存开销大的问题,提出了一种混合KV缓存压缩方法Forcing-KV。通过分析主流模型中注意力头的功能特性,将头分为关注帧内细节和块间过渡的静态头,以及控制帧间运动和一致性的动态头,并分别采用结构化剪枝和基于片段相似度的动态剪枝策略。该方法在保持生成质量的同时,显著提升了生成速度并减少了内存占用,实现在单块NVIDIA H200 GPU上每秒生成29帧以上。

Comments 10 pages

详情
英文摘要

Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at https://zju-jiyicheng.github.io/Forcing-KV-Page.

2605.09679 2026-05-12 cs.CV cs.AI

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Boyan Wang, Liang He, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, Alan Yuille

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Bologna(博洛尼亚大学) Istanbul Medipol University(伊斯坦布尔梅迪波尔大学) Center for Biomolecular Nanotechnologies, Istituto Italiano di Tecnologia(生物分子纳米技术中心,意大利技术研究院) The First Affiliated Hospital, Sun Yat-Sen University(中山大学第一附属医院) Tongji University(同济大学)

AI总结 DeepTumorVQA 是一个面向医学影像的层次化3D CT基准,旨在对医疗视觉语言模型(VLMs)和工具增强代理进行分阶段评估。该基准将肿瘤诊断中的推理过程分解为识别、测量、视觉推理和医学推理四个阶段,使模型在不同层次上的表现能够被独立评估。研究还引入了工具交互环境,允许模型调用分割、测量和医学知识等外部工具,从而更贴近实际医疗场景。实验表明,工具增强显著提升了模型在复杂医学推理任务中的表现。

详情
英文摘要

Medical vision-language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later-stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground-truth step-by-step tool-use traces from DeepTumorVQA can supervise agents and reduce tool-use and reasoning failures. This stage-wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at https://github.com/Schuture/DeepTumorVQA.

2605.09678 2026-05-12 cs.AI

Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

Ryan Albright, Golam Md Muktadir, Zarif Ikram, S M Jubaer, Mehrab Hossain, Dianbo Liu

发表机构 * The Nueva School(新维学校) University of Southern California(南加州大学) Notre Dame College(诺特大学) Arizona State University(亚利桑那州立大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种名为 Absurd World 的基准框架,用于测试大语言模型(LLM)在逻辑推理方面的能力。该方法通过将现实世界的问题分解为符号、动作、序列和事件,并自动修改这些元素以构建逻辑自洽但荒谬的场景,从而在保持任务逻辑不变的前提下,检验 LLM 是否能够忽略现实世界中的模式进行推理。实验表明,Absurd World 是评估 LLM 逻辑推理鲁棒性的一种有效工具。

详情
英文摘要

While extremely powerful and versatile at various tasks, the thinking capabilities of large language models (LLMs) are often put under scrutiny as they sometimes fail to solve problems that humans can systematically solve. However, recent literature focuses on breaking LLM reasoning with increasingly complex problems, and whether an LLM is robust in simple logical reasoning remains underexplored. This paper proposes Absurd World, a benchmarking framework, to test LLMs against altered realism, where scenarios are logically coherent, and humans can easily solve the tasks. Absurd World breaks a real-world model into symbols, actions, sequences, and events, which are automatically altered to create absurd worlds where the logic to solve the tasks remains the same. It evaluates a large collection of models with simple and advanced prompting techniques, and proves that it is an effective tool to determine LLMs' ability to think logically, ignoring the patterns learned from the real world. One can use this framework to extensively test an LLM against a real-world problem to verify whether the LLM's reasoning capability is robust against variations of the task.

2605.09677 2026-05-12 cs.CV

VFM-SDM: A vision foundation model-based framework for training-free, marker-free, and calibration-free structural displacement measurement

Qingyu Xian, Hao Cheng, Berend Jan van der Zwaag, Rolands Kromanis, Ozlem Durmaz Incel

发表机构 * Pervasive Systems Research Group, Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente(普罗普及系统研究组,电气工程、数学与计算机科学学院,埃因霍温理工大学) Department of Earth Observation Science, Faculty of Geo-Information Science and Earth Observation (ITC), University of Twente(地球观测科学系,地理信息科学与地球观测(ITC)学院,埃因霍温理工大学) Department of Civil Engineering and Management, Faculty of Engineering Technology, University of Twente(土木工程与管理系,工程科技学院,埃因霍温理工大学)

AI总结 本文提出了一种基于视觉基础模型(VFM)的结构位移测量框架VFM-SDM,能够在无需任务特定训练、无需现场标记和标定的情况下,实现多方向结构位移的非接触式测量。该方法结合VFM推断的相机参数估计与点跟踪技术,通过三角化重建位移,并引入结构几何约束以提升估计的物理合理性和一致性。实验结果表明,该框架在真实场景中具有较高的测量精度和稳定性,为自动化、可扩展的结构健康监测提供了新思路。

详情
英文摘要

Reliable displacement measurement is fundamental for structural health monitoring and digital engineering workflows, as it provides direct structural response information. Vision-based measurement has emerged as a promising approach for low-cost, non-contact displacement monitoring. However, its deployment often remains constrained by task-specific model training or on-site preparation, such as marker installation or manual camera calibration. This study presents a Vision Foundation Model-based framework for Structural Displacement Measurement (VFM-SDM) that integrates VFM-inferred camera parameter estimation and point tracking to reconstruct multi-directional structural displacements via triangulation without task-specific training or on-site preparation, enabling efficient non-contact deployment in real-world applications. Structural geometry constraints are incorporated to suppress physically implausible deviations and improve estimation consistency. A multi-modal field dataset collected from an in-service pedestrian bridge is introduced alongside a unified benchmarking protocol to support reproducible evaluation. Representative results show low amplitude errors (NRMSE$_{\text{range}}$: 0.11/0.12), strong temporal agreement (correlation coefficient: 0.86/0.88), and small peak-to-peak amplitude errors (RPPAE: 0.01/0.02) for vertical and lateral displacements, indicating robust performance under real-world conditions. The proposed framework advances automated, scalable displacement monitoring and lays the groundwork for VFM-enabled structural response measurements in digital twin and data-centric construction workflows.

2605.09676 2026-05-12 cs.LG cs.AI nlin.CD

ChaosNetBench: Benchmarking Spatio-Temporal Graph Neural Networks on Chaotic Lattice Dynamics

Henok Tenaw Moges, Charalampos Skokos, Deshendran Moodley

发表机构 * Centre for Artificial Intelligence Research (CAIR)(人工智能研究中心(CAIR)) University of Cape Town(开普敦大学) Nonlinear Dynamics and Chaos Group(非线性动力学与混沌组) Department of Mathematics and Applied Mathematics(数学与应用数学系)

AI总结 该论文提出了一种名为ChaosNetBench(CNB)的合成基准数据集与评估框架,用于在受控的多维混沌动力学条件下评估时空图神经网络(STGNN)的性能。CNB基于耦合标准映射的晶格系统构建,允许独立调节局部混沌强度、耦合强度和系统规模,提供了96个系统实例和9600条轨迹的已知拓扑与动力学信息。研究引入了混沌指标和评估协议,通过对比13种不同架构的性能,揭示了STGNN在应对不同层次局部与全局混沌时相较于非图结构模型的优越性。

Comments 24 pages, 11 figures

详情
英文摘要

Spatio-temporal graph neural networks (STGNNs) are widely used for short-term forecasting in dynamic physical systems such as traffic and weather. However, the prevailing evaluation practice uses real world benchmark data sets in a single domain with a single fixed holdout splits, making it difficult to compare architectures across different dynamical regimes. We introduce ChaosNetBench (CNB), a synthetic benchmark dataset and evaluation framework for studying STGNN performance under controlled multidimensional chaotic dynamics. CNB is built on a lattice of coupled standard maps with independently tunable local chaos ($K$), coupling strength ($\varepsilon$), and system size ($N$), providing known topology and known dynamics across 96 system instances and 9{,}600 trajectories. We introduce chaos indicators, evaluation metrics and a protocol to analyze and compare the capacity of STGNN architectures to deal with different levels of local and global chaos. We illustrate the usage of the framework by analyzing 13 architectures (5 STGNNs and 8 non-graph baselines). The results reveal a regime dependent transition in which non-graph baselines (TCN, N-BEATS, iTransformer) remain competitive when there is low local chaos, while STGNNs (e.g., Graph WaveNet, D2STGNN, STAEformer) are generally more resilient to higher levels of local and global chaos. CNB provides a practical, reusable testbed for systematically comparing and analyzing the capacity of STGNN architectures to handle different levels of local and global chaos.

2605.09675 2026-05-12 cs.AI cs.MA

CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents

Timothy Ossowski, Xinchi Liu, Danyal Maqbool, Vaibhav Dhanuka, Sheng Zhang, Hoifung Poon, Majid Afshar, Tyler Bradshaw, Junjie Hu

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Microsoft Research(微软研究院)

AI总结 本文提出CodeClinic,一个基于MIMIC-IV构建的基准,用于评估大型语言模型在临床推理任务中是否能够自动生成和组合可复用的临床技能,而非依赖固定工具库。该基准包含两个互补任务:长期ICU监测和组合信息检索,分别用于评估模型在结构化决策和多步骤推理方面的能力。研究还提出了一种离线自动形式化流程,通过迭代优化将自然语言临床指南转化为可验证的Python技能库,显著提升了推理一致性并减少了每查询的计算开销。

详情
英文摘要

Clinical reasoning agents based on large language models (LLMs) aim to automate tasks such as intensive care unit (ICU) monitoring and patient state tracking from electronic health records (EHRs). Existing systems typically rely on manually curated clinical tools or skills for concepts such as sepsis detection and organ failure assessment. However, maintaining these tool libraries requires substantial expert effort, while zero-shot querying or code generation often produces inefficient and unreliable reasoning chains, especially under institution-specific clinical policies. We introduce CodeClinic, a benchmark built on MIMIC-IV for evaluating whether LLM agents can synthesize and compose reusable clinical skills instead of relying on fixed toolboxes. The benchmark contains two complementary tasks: longitudinal ICU surveillance and compositional information seeking. The longitudinal setting simulates monitoring patient trajectories with structured decisions every four hours across 25 findings and eight clinical families, while the compositional setting spans 63k instances across 259 tasks in nine domains and is stratified by compositional dependency depth to evaluate increasingly complex multi-step reasoning. We further propose an offline autoformalization pipeline that converts natural-language clinical guidelines into reusable and verified Python skill libraries through iterative LLM refinement. Compared with zero-shot code generation, the resulting libraries improve consistency while reducing per-query token usage by up to 40%.

2605.09672 2026-05-12 cs.RO

MVB-Grasp: Minimum-Volume-Box Filtering of Diffusion-based Grasps for Frontal Manipulation

Bibek Poudel, Abdul Basit, Muhammad Shafique

发表机构 * Unitree(单位树) Intel(英特尔)

AI总结 本文针对低成本机械臂在受限工作空间中的正面抓取任务,提出了一种基于最小体积包围盒(MVBB)的抓取过滤方法MVB-Grasp,有效提升了抓取成功率。该方法通过引入几何先验,结合定向包围盒的面法线进行快速过滤,并融合学习到的判别器分数与面对齐几何信息,优化抓取候选方案。实验表明,MVB-Grasp在Unitree Z1机械臂上实现了比传统方法高出2.4倍的成功率,验证了其在受限空间抓取任务中的有效性。

Comments 8 pages, 12 figures, accepted to IJCNN 2026

详情
英文摘要

State-of-the-art 6-DoF grasp generators excel on tabletop benchmarks with overhead cameras but struggle in frontal grasping scenarios on low-cost manipulators with constrained workspaces, where kinematic limits and approach-direction constraints cause high failure rates. We address this challenge for the Unitree Z1 arm by proposing MVB-Grasp, a novel grasping stack that injects a Minimum Volume Bounding Box (MVBB) geometric prior into diffusion-based grasp generation to dramatically improve success rates in frontal, workspace-constrained settings. Our key scientific contributions are threefold: (i) an MVBB-based geometric filter that exploits oriented bounding-box face normals to reject grasps approaching through the table or misaligned with accessible object faces in O(N) time; (ii) a combined re-scoring function that blends learned discriminator scores with face-alignment geometry α=0.85, specifically calibrated for the Z1's frontal workspace and kinematic constraints; and (iii) a systematic MuJoCo evaluation protocol measuring grasp success across object types, distances, lateral positions, and pitch orientations to validate embodiment-specific performance. We implement MVB-Grasp on a Unitree Z1 arm with an Intel RealSense D405 camera, integrating YOLOv8 object detection, GraspGen for candidate generation, Principal Component Analysis (PCA)-based MVBB fitting, and inverse-kinematics trajectory planning. Experiments across 81 MuJoCo episodes (cylinder, asymmetric box, waterbottle) demonstrate that MVB-Grasp achieves 59.3% success versus 24.7% for vanilla GraspGen, a 2.4x improvement, by filtering geometrically infeasible candidates and prioritizing face-aligned grasps suited to the Z1's frontal approach constraints. Real-world trials confirm that the MVBB prior substantially improves grasp reliability on constrained, low-cost manipulators without requiring model retraining.

2605.09670 2026-05-12 cs.RO cs.CV

Towards Generative Predictive Display for Vision-Based Teleoperation: A Zero-Shot Benchmark of Off-the-Shelf Video Models

Aws Khalil, Jaerock Kwon

发表机构 * Department of Electrical and Computer Engineering, University of Michigan - Dearborn(密歇根大学迪尔伯恩分校电气与计算机工程系)

AI总结 本文研究了基于视觉的遥操作系统中预测显示技术的生成能力,旨在通过生成未来视觉状态来缓解通信延迟带来的影响。作者提出了一种无需任务微调的零样本基准,评估了多种现成的生成视频模型在短时预测显示中的表现。实验表明,现有模型在预测精度、推理延迟和误差稳定性等方面难以同时满足预测显示的需求,揭示了通用生成视频模型与遥操作预测显示应用之间的性能差距。

详情
英文摘要

Teleoperation systems are fundamentally limited by communication latency, which degrades situational awareness and control performance. Predictive display aims to mitigate this limitation by presenting an estimate of the current visual state rather than delayed observations. While recent advances in generative video models enable high-quality video synthesis, their suitability for latency-sensitive predictive display remains unclear. This paper presents a zero-shot benchmark of off-the-shelf generative video models for short-horizon predictive display, without task-specific fine-tuning. We formulate the problem as rollout-based future frame prediction and develop a unified benchmarking pipeline using simulated driving data from the CARLA simulator. Five publicly released video models spanning transformer-based and diffusion-based families are evaluated across two resolutions and two conditioning regimes (multi-frame and single-frame). Performance is assessed using prediction accuracy (mean absolute difference), per-rollout latency, peak GPU memory usage, and temporal error evolution across the prediction horizon. On this zero-shot benchmark, no tested model simultaneously achieves low rollout error, non-divergent per-step error behavior, and real-time inference at the source frame rate. Increasing model scale or resolution yields limited and, in some cases, inverted improvements. These findings highlight a gap between general-purpose generative video synthesis and the requirements of predictive display in teleoperation, suggesting that practical deployment will require either explicit short-horizon temporal supervision, in-domain adaptation, or aggressive inference optimization rather than direct application of off-the-shelf models. Code, configurations, and qualitative results are released on the project page: https://bimilab.github.io/paper-GenPD

2605.09667 2026-05-12 cs.CV cs.AI

S2P-Net: A Spectral-Spatial Polar Network for Rotation-Invariant Object Recognition in Low-Data Regimes

Albert Heruth

发表机构 * Unaffiliated Researcher(无隶属研究人员)

AI总结 本文提出了一种名为S2P-Net的紧凑型深度学习网络架构,用于在数据量较少的情况下实现旋转不变的目标识别,且无需数据增强即可保证数学上的旋转不变性。该网络结合了频域与空域信息,并通过极坐标变换增强其对旋转的鲁棒性。与传统卷积神经网络相比,S2P-Net在小样本场景下表现出更优的识别性能,为低数据条件下的旋转不变目标识别提供了新思路。

Comments 9 pages, 4 figures, 3 tables. Preprint. Code available from the author upon request

详情
英文摘要

We present S2P-Net (Spectral-Spatial Polar Network), a compact deep learning architecture that achieves mathematically guaranteed rotation invariance without data augmentation. In this Paper, we also made a comparison to other neural network architectures (CNN`s). Have a look at the results and feel free to contact me for any questions. This is my first paper:) Made by Hackbert

2605.09666 2026-05-12 cs.CV cs.AI

Rethinking Evaluation of Multiple Sclerosis (MS) Lesion Segmentation Models

Abdul Basit, Ashir Rashid, Muhammad Abdullah Hanif, Muhammad Shafique

发表机构 * eBRAIN Lab, Division of Engineering, New York University (NYU) Abu Dhabi(eBRAIN实验室,工程学院,纽约大学(纽约大学阿布扎克分校))

AI总结 本文探讨了多发性硬化症(MS)病灶分割模型评估方法的不足,指出当前大多使用Dice分数进行评估,未能充分考虑病灶级别的检测与分割性能,以及对复杂或人类标注者难以判断情况的模型表现。作者详细分析了神经科医生在脑部MRI扫描中关注的特征,并提出了更符合实际需求的评估指标,同时在两个开源数据集上对现有先进模型进行了分析,以评估其在实际医疗场景中的适用性。

Comments 8 pages, 5 figures, Accepted to IJCNN 2026

详情
英文摘要

Multiple Sclerosis (MS) is a chronic autoimmune disease that can significantly reduce the quality of life of a patient. Existing treatment options can only help slow down the progression of the disease. Therefore, early detection and precise monitoring of disease progression are important. Deep learning offers state-of-the-art models for detecting and segmenting MS lesions in brain MRI scans. However, most of these models are evaluated using the Dice score, without accounting for lesion-wise detection and segmentation performance or other metrics that quantify model performance in cases that are complex or confusing for human annotators, or in cases that are essential for disease detection and progression monitoring. In this paper, we highlight the need to rethink the evaluation of MS lesion segmentation models. In this context, we first present problem fingerprinting in detail to highlight what neurologists look for in brain MRI scans for MS detection and progression monitoring, and which metrics are required to properly quantify model performance in these contexts. Additionally, we present an analysis of state-of-the-art models on two open-source datasets using these metrics to highlight their usability for real-world deployment in hospitals.

2605.09665 2026-05-12 cs.LG cs.AI cs.CL

Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies

Jingze Song, Zihao Chen, Wenqing Chen, Zibin Zheng

发表机构 * School of Software Engineering, Sun Yat-sen University(中山大学软件学院)

AI总结 本文研究了在指令微调中如何高效选择训练数据的问题,提出了一种联合任务-模型自适应的框架,用于学习多指标权重以优化数据选择。该方法通过在小型验证集上利用上下文学习信号,无需大规模微调即可确定最优权重配置,从而实现高效且高保真的数据评估。实验表明,该方法在多个基准和模型家族上表现出与全数据微调相当甚至更优的效果,并揭示了推理任务中语义多样性与逻辑复杂性的权衡关系。

Comments This work has been accepted at IJCAI 2026

详情
英文摘要

Data selection is a key component of efficient instruction tuning for large language models, as recent work has shown that data quality often matters more than data quantity. Accordingly, prior studies have introduced various multi-dimensional heuristics to evaluate and filter instruction data. However, most existing methods rely on static task-agnostic and model-agnostic weighting schemes, which overlook the varying requirements of specific downstream tasks and the differing pre-existing capabilities of models. In this paper, we propose a framework for learning multi-indicator weights that jointly adapts data selection to both the downstream task and the specific model. Our method identifies optimal weight configurations without full-scale fine-tuning by utilizing in-context learning (ICL) signals on compact tiny-validation sets. These signals serve as efficient performance proxies that ensure high-fidelity evaluation at minimal computational cost. Experiments across multiple benchmarks and model families, including Mistral, Qwen, and Llama, show that the approach achieves performance comparable to or exceeding full-dataset tuning while using only 30\% of the training samples on GSM8K. Furthermore, our analysis reveals a trade-off between semantic diversity and logical complexity in reasoning tasks, highlighting the necessity of joint task-model adaptation.

2605.09663 2026-05-12 cs.LG cs.AI

Causal Parametric Drift Simulation: A Digital Twin Framework for Classifier Robustness Evaluation

Julien Lafrance, Richard Khoury, Véronique Tremblay

发表机构 * Laval University(拉瓦尔大学)

AI总结 在动态环境中,机器学习分类器常因概念漂移导致性能下降,而传统评估方法难以准确反映数据生成过程中的因果依赖关系。本文提出了一种基于结构因果模型的数字孪生框架——因果参数漂移模拟(Causal Parametric Drift Simulation),通过精确的因果干预揭示分类器在部署前的潜在脆弱性。实验表明,该方法能发现标准统计监测手段无法识别的隐藏问题,为分类器鲁棒性评估提供了新的有效工具。

Comments 34 pages, 13 figures, 14 tables

详情
英文摘要

Machine learning classifiers in dynamic environments face concept drift -- changes in the data-generating process that degrade performance. Conventional evaluation via static test sets or noise perturbations fails to preserve causal dependencies in tabular data, often producing causally invalid assessments. Post-hoc tools like SHAP and LIME offer correlational insights that may not reflect the causal mechanisms driving model failure. We propose a framework that complements existing drift detection by leveraging Structural Causal Models as "Digital Twins" of data-generating processes, enabling precise causal interventions while preserving structural dependencies. Our technique, Causal Parametric Drift Simulation, stress-tests classifiers to identify vulnerabilities before deployment. Experiments on the Open Sourcing Mental Illness (OSMH) dataset demonstrate that this approach exposes latent vulnerabilities invisible to standard statistical monitors.

2605.09662 2026-05-12 cs.CV

BEA-GS: BEyond RAdiance Supervision in 3DGS for Precise Object Extraction

Alessio Mazzucchelli, Maria Naranjo-Almeida, Jorge Bustos-Sanchez, Mariella Dimiccoli, Francesc Moreno-Noguer, Jordi Sanchez-Riera, Adrian Penate-Sanchez

发表机构 * Arquimea Research Center(阿奎米亚研究中心) Institut de Robòtica i Informàtica Industrial (CSIC-UPC)(机器人与信息技术研究所(CSIC-UPC)) Universidad de las Palmas de Gran Canaria (IUSIANI)(Gran Canaria大学(IUSIANI))

AI总结 本文提出了一种名为BEA-GS的新型高斯泼溅方法,旨在在无需辐射监督的情况下实现更精确的物体提取。该方法通过引入两种新的损失函数,分别优化可见和不可见高斯点的几何结构,以更准确地对齐语义边界。实验表明,该方法在多个数据集上取得了当前最佳的边界分割效果,显著提升了物体级编辑和资产提取的精度。

Comments CVPR 2026 Highlight

详情
英文摘要

Most Gaussian Splatting techniques that provide a 3D semantic representation of the scene do not optimize the underlying 3D geometry, making object-level editing or asset extraction challenging. Recent methods, such as COBGS, Trace3D, ObjectGS, acknowledge this limitation and propose approaches that modify the scene's geometry to represent the underlying semantics. We advance this concept further by proposing a novel solution that provides near perfect boundaries in object extraction. We do so by introducing two new losses in the optimization that take care of: 1) a loss that modifies the geometry of visible Gaussians to respect semantic boundaries, and 2) a loss that adjusts the geometry of non-visible Gaussians that appear once the object is extracted. Our first loss propagates gradients directly through the rasterization, allowing for seamless integration within the optimization of the Gaussian parameters. The second loss also propagates gradients to Gaussian parameters but does so without passing through the rasterization, enabling modification of the scene's geometry even when little transmittance reaches a Gaussian (partial or non-visible). Exhaustive comparisons with 12 state of the art methods across 4 datasets, using six metrics, demonstrate that our approach produces overall the best boundary segmentation to date.

2605.09661 2026-05-12 cs.CL cs.AI

MedMeta: A Benchmark for LLMs in Synthesizing Meta-Analysis Conclusion from Medical Studies

Huy Hoang Ha, Benoit Favre, Francois Portet

发表机构 * GitHub

AI总结 本文提出MedMeta,一个用于评估大语言模型(LLM)从医学研究摘要中合成元分析结论能力的新基准。该基准包含81项来自PubMed的元分析,通过两种流程评估模型:基于真实摘要的检索增强生成(Golden-RAG)和仅依赖内部知识的参数化方法。研究发现,基于外部信息的Golden-RAG方法显著优于仅依赖内部知识的方法,而领域微调的效果有限,且当前模型在处理否定性证据时表现不佳,突显了RAG系统在临床应用中的重要性和现有模型的不足。

详情
英文摘要

Large language models (LLMs) have saturated standard medical benchmarks that test factual recall, yet their ability to perform higher-order reasoning, such as synthesizing evidence from multiple sources, remains critically under-explored. To address this gap, we introduce MedMeta, the first benchmark designed to evaluate an LLM's ability to generate conclusions from medical meta-analyses using only the abstracts of cited studies. MedMeta comprises 81 meta-analyses from PubMed (2018--2025) and evaluates models using two distinct workflows: a Retrieval-Augmented Generation (Golden-RAG) setting with ground-truth abstracts, and a Parametric-only approach relying on internal knowledge. Our evaluation framework is validated by a well-structured analysis showing our LLM-as-a-judge protocol strongly aligns with human expert ratings, as evidenced by high Pearson's r correlation (0.81) and Bland-Altman analysis revealing negligible systematic bias, establishing it as a reliable proxy for scalable evaluation. Our findings underscore the critical importance of information grounding: the Golden-RAG workflow consistently and significantly outperforms the Parametric-only approach across models. In contrast, the benefits of domain-specific fine-tuning are marginal and largely neutralized when external material is provided. Furthermore, stress tests show that all models, regardless of architecture, fail to identify and reject negated evidence, highlighting a critical vulnerability in current RAG systems. Notably, even under ideal RAG conditions, current LLMs achieve only slightly above-average performance (~2.7/5.0). MedMeta provides a challenging new benchmark for evidence synthesis and demonstrates that for clinical applications, developing robust RAG systems is a more promising direction than model specialization alone.

2605.09659 2026-05-12 cs.RO

ASACK : Adaptive Safe Active Continual Koopman Learning for Uncertain Systems with Contractive Guarantees

Chandan Kumar Sah, Rajpal Singh, Jishnu Keshavan

发表机构 * Department of Mechanical Engineering, Indian Institute of Science(机械工程系,印度科学研究院)

AI总结 本文提出了一种名为ASACK的自适应安全主动持续Koopman学习框架,用于在存在模型不确定性和分布偏移的不确定系统中进行安全控制。该方法通过一个基于自编码器的Koopman模型进行离线学习,并利用收缩性适应律进行在线模型修正,从而在理论上有分布偏移和模型不确定性下的收敛保证。为提高数据效率,该方法结合主动学习策略,在完成任务目标的同时引导系统采集信息量大的数据,并将主动学习目标与安全约束整合到非凸优化问题中,最终通过鲁棒MPC框架实现形式化的安全保证。实验表明该方法在性能上优于现有先进方法。

详情
英文摘要

Koopman operator theory provides a powerful framework for representing nonlinear dynamics through a linear operator acting on lifted observables, enabling the use of linear control techniques for nonlinear systems. However, Koopman models are typically learned from data and often degrade in performance under model uncertainty and distributional shifts between training and deployment. Although several works have explored online adaptation to address this issue, many rely on neural network-based updates that introduce significant computational overhead and lack formal safety guarantees, limiting their suitability for real-time and safety-critical robotic applications. In this work, we propose a unified framework for continual adaptive Koopman learning that enables safe and efficient online refinement of learned models during task execution. An autoencoder-based Koopman model is first learned offline and subsequently refined online through a contractive adaptation law, which provides theoretical convergence guarantees under distributional shifts and model uncertainty. To improve data efficiency and accelerate model refinement, the adaptation mechanism is integrated with an active learning strategy that drives the system to collect informative data while accomplishing task objectives. The resulting control problem is formulated as a nonconvex optimization problem incorporating both active learning objectives and safety constraints. We further derive theoretical bounds on model approximation error and show how these bounds can be incorporated within a robust Model Predictive Control (MPC) framework to provide formal safety guarantees. The proposed approach unifies learning, excitation, and safety within a single control framework without sacrificing real-time feasibility. Extensive simulation and experimental studies demonstrate superior performance compared to state-of-the-art baselines.

2605.09656 2026-05-12 cs.RO

ORICF -- Open Robotics Inference and Control Framework

Andrés Meseguer Valenzuela, Luís Miguel Bartolín Arnau

发表机构 * Instituto Tecnológico de Informática (ITI)(技术信息学院)

AI总结 本文提出了一种名为ORICF的开放机器人推理与控制框架,旨在解决当前人工智能在机器人应用中计算开销大、延迟高和能耗高的问题。该框架具有模块化、声明式和模型无关的特点,支持通过轻量级YAML配置灵活调整模型、硬件和数据通道,无需修改代码。研究通过在移动机器人上结合语音识别、大语言模型和目标检测模型进行实验,验证了ORICF在边缘计算部署下可显著降低机器人端的计算负载和能耗,同时保持系统的模块化与可复现性。

Comments Accepted in ICRA26 Workshop: 8th International Workshop on Robotics Software Engineering (RoSE 26)

详情
英文摘要

Recent advances in artificial intelligence (AI) have enabled effective perception and language models for robots, but their deployment remains computationally expensive, increasing latency and energy use. This work presents the Open Robotics Inference and Control Framework (ORICF), a modular, declarative, and model-agnostic platform for composing multimodal robotic inference pipelines. ORICF integrates input/output (I/O) adapters, pluggable inference back ends, and post-processing logic, while lightweight YAML specifications allow models, hardware targets, and data channels to be changed without code modification. The framework also supports edge offloading, i.e., executing inference on nearby external computers instead of onboard the robot. ORICF is evaluated on a mobile robot that answers spoken queries about people detected in its camera stream by combining automatic speech recognition (ASR), a large language model (LLM), and a convolutional neural network (CNN) detector through Robot Operating System 2 (ROS2). Compared with onboard execution, ORICF-based edge deployment reduces robot-side compute utilization by up to 83.16% and estimated energy consumption by 65.8%, while preserving modularity and reproducibility.

2605.09650 2026-05-12 cs.AI cs.LG

Workspace Optimization: How to Train Your Agent

Elad Sarafian, Gal Kaplun, Ron Banner, Daniel Soudry, Boris Ginsburg

发表机构 * NVIDIA

AI总结 本文研究了如何通过优化智能体的“工作空间”来提升其在复杂多轮任务中的表现。作者提出,当前沿语言模型的权重难以调整时,应通过结构化的外部工作空间进行训练,这一过程称为“工作空间优化”。为此,他们设计了DreamTeam系统,通过多智能体协作构建可执行的世界模型,并在ARC-AGI-3数据集上实现了比现有最优方法更高的任务解决率,同时减少了环境交互动作的使用。

详情
英文摘要

Modern agents built on frontier language models often cannot adapt their weights. What, then, remains trainable? We argue it is the agent's \emph{workspace}, the structured external substrate it reads, writes, and tests; we call its evolution workspace optimization. Workspace optimization targets hard multi-turn environments where a frontier model has strong priors but cannot solve the task in a single shot, so the agent must learn through interaction. We propose a principled way to evolve the workspace, mirroring the structure of weight-space training: artifacts in place of parameters, evidence in place of data, counterexamples in place of losses, and textual feedback in place of gradients. We instantiate the idea in DreamTeam, a multi-agent harness for ARC-AGI-3 whose roles build an executable world model, plan, hypothesize, probe, strategize, and route failures. On the current 25-game ARC-AGI-3 public set under the official scoring protocol and averaged over two independent runs, DreamTeam improves the SOTA protocol-matched agent's score from 36% to 38.4%, while using 31% fewer environment actions per game.

2605.09649 2026-05-12 cs.LG

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Ngoc Bui, Hieu Trung Nguyen, Arman Cohan, Rex Ying

发表机构 * Department of Computer Science Yale University(耶鲁大学计算机科学系) The Chinese University of Hong Kong(香港中文大学)

AI总结 该论文研究了如何通过改进键值(KV)缓存的管理策略来提升模型在长上下文推理中的性能。作者提出了一种基于全局保留机制的KV缓存淘汰方法,通过学习每个token的未来有用性,在统一的内存预算下进行选择性淘汰,从而在减少内存消耗的同时提升生成质量。实验表明,该方法在多个长上下文语言和视觉语言推理任务中,能够有效降低KV内存占用并达到或超越全缓存推理的效果。

Comments A learnable KV eviction method for large language models

详情
英文摘要

The key-value (KV) cache is a major bottleneck in long-context inference, where memory and computation grow with sequence length. Existing KV eviction methods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce a global retention-based KV eviction method that learns each token's future utility under a unified memory budget. Lightweight retention gates assign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities compete directly for cache capacity. We further provide theoretical analysis showing that preferentially retaining useful tokens reduces attention dilution, and we justify geometric retention as a query-agnostic proxy for future utility. Across diverse long-context language and vision-language reasoning, and multi-turn dialogue benchmarks, our method substantially reduces KV memory while matching or surpassing full-cache inference. These results suggest that learned, globally calibrated KV eviction is not only a compression technique, but also a mechanism for improving long-context reasoning.

2605.09640 2026-05-12 cs.CV cs.LG

Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

Meng Lou, Hanzhong Guo, Linwei Chen, Yizhou Yu

发表机构 * The University of Hong Kong(香港大学) The Hong Kong University of Science and Technology(香港科学与技术大学) Hong Kong Generative AI Research and Development Center(香港生成式人工智能研究与开发中心)

AI总结 本文研究了如何在视觉持续学习中克服灾难性遗忘问题,提出了一种基于强化微调的新方法RaPO。作者发现现有方法如GRPO在面对类别增量和领域增量学习时仍存在显著遗忘,其根本原因在于轨迹层面的策略漂移。为此,RaPO通过引入保留奖励和跨任务优势归一化,有效缓解了策略漂移带来的遗忘问题,实验表明其在多个持续学习场景中均取得优越性能,为视觉持续学习中的强化微调提供了系统性探索。

详情
英文摘要

Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.