arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
2510.24570 2026-05-13 cs.CL

BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

Raphaël Bagat, Irina Illina, Emmanuel Vincent

发表机构 * Université de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France(洛林大学、法国国家科学研究中心、法国国家信息与自动化研究所、LORIA研究所、法国南锡市)

AI总结 本文提出了一种名为BEARD的新型框架,用于在缺乏标注数据的低资源场景下对Whisper语音识别模型进行领域自适应。该方法结合了BEST-RQ自监督学习目标与知识蒸馏技术,通过未标注数据微调Whisper编码器,并与预训练解码器保持互补性。实验表明,在具有非母语发音、噪声和专业术语的航空管制通信领域,该方法在仅使用5000小时未转录语音和2小时标注语音的情况下,相比已有基线和微调模型,相对提升了12%的识别性能,是首个将自监督学习应用于Whisper领域自适应的工作。

Comments Accepted to ICASSP 2026

详情
英文摘要

Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper's encoder with unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder's complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.

2510.06371 2026-05-13 cs.CL cs.AI

OASIS: A Multilingual and Multimodal Dataset for Culturally Grounded Spoken Visual QA

Firoj Alam, Ali Ezzat Shahroor, Md. Arid Hasan, Zien Sheikh Ali, Hunzalah Hassan Bhatti, Mohamed Bayan Kmainasi, Shammur Absar Chowdhury, Basel Mousi, Fahim Dalvi, Nadir Durrani, Natasa Milic-Frayling

发表机构 * Qatar Computing Research Institute(卡塔尔计算研究 institute)

AI总结 OASIS 是一个大规模的多语言、多模态数据集,旨在支持基于文化背景的口语视觉问答任务。该数据集包含大量图像、文本和语音数据,涵盖英语和阿拉伯语多种变体,适用于评估模型在常识推理、文化理解和真实场景中的表现。研究提出了一种可扩展的半自动框架 EverydayMMQA 用于构建本地化的问答资源,并通过多阶段人工验证确保数据质量,为多模态模型的训练与评估提供了重要支持。

Comments Multimodal Foundation Models, Large Language Models, Native, Multilingual, Language Diversity, Contextual Understanding, Culturally Informed

详情
英文摘要

Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they are often limited when queries require cultural and visual information, everyday knowledge, particularly in low-resource and underrepresented languages. We introduce OASIS, a large-scale culturally grounded multimodal QA dataset covering images, text, and speech. OASIS is built with EverydayMMQA, a scalable semi-automatic framework for creating localized spoken and visual QA resources, supported by multi-stage human-in-the-loop validation. OASIS contains approximately 0.92M real images and 14.8M QA pairs, including 3.7M spoken questions, with 383 hours of human-recorded speech, and 20K hours of voice-cloned speech, from 42 speakers. It supports four input settings: text-only, speech-only, text+image, and speech+image. The dataset focuses on English and Arabic varieties across 18 countries, covering Modern Standard Arabic (MSA) as well as dialectal Arabic. It is designed to evaluate models beyond object recognition, targeting pragmatic, commonsense, and culturally grounded reasoning in real-world scenarios. We benchmark four closed-source models, three open-source models, and one fine-tuned model on OASIS. The framework and dataset will be made publicly available to the community. https://huggingface.co/datasets/QCRI/OASIS

2510.05408 2026-05-13 cs.CV cs.AI

See the past: Time-Reversed Scene Reconstruction from Thermal Traces Using Visual Language Models

Kebin Contreras, Luis Toscano-Palomino, Mauro Dalla Mura, Jorge Bacca

发表机构 * Physics School, Universidad Industrial de Santander, Colombia(圣安德烈大学物理系,哥伦比亚) Department of Computer Science, Universidad Industrial de Santander, Colombia(圣安德烈大学计算机科学系,哥伦比亚) GIPSA-Lab, Université Grenoble Alpes, CNRS, Grenoble INP, Grenoble, France(格拉斯实验室,格勒诺布尔阿尔卑斯大学,CNRS,格勒诺布尔INP,法国) Institut Universitaire de France (IUF), France(法国国家科学院(IUF))

AI总结 该研究提出了一种基于热成像和视觉语言模型的时序逆向重建方法,旨在从当前的热痕迹中恢复过去几秒内的场景状态。方法结合了视觉语言模型与约束扩散过程,通过生成场景描述并指导图像重建,确保语义与结构的一致性。实验表明,该方法能够在受控环境下重建出最多120秒前的合理场景画面,为基于热痕迹的时序逆向成像提供了初步实现。

详情
英文摘要

Recovering the past from present observations is an intriguing challenge with potential applications in forensics and scene analysis. Thermal imaging, operating in the infrared range, provides access to otherwise invisible information. Since humans are typically warmer (37 C -98.6 F) than their surroundings, interactions such as sitting, touching, or leaning leave residual heat traces. These fading imprints serve as passive temporal codes, allowing for the inference of recent events that exceed the capabilities of RGB cameras. This work proposes a time-reversed reconstruction framework that uses paired RGB and thermal images to recover scene states from a few seconds earlier. The proposed approach couples Visual-Language Models (VLMs) with a constrained diffusion process, where one VLM generates scene descriptions and another guides image reconstruction, ensuring semantic and structural consistency. The method is evaluated in three controlled scenarios, demonstrating the feasibility of reconstructing plausible past frames up to 120 seconds earlier, providing a first step toward time-reversed imaging from thermal traces.

2510.04265 2026-05-13 cs.AI cs.CL math.ST stat.ML stat.TH

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary

发表机构 * Department of Computer and Data Sciences, Case Western Reserve University(计算机与数据科学系,凯斯西储大学) Department of Physics, Case Western Reserve University(物理系,凯斯西储大学)

AI总结 本文提出了一种基于贝叶斯框架的大语言模型评估方法,旨在解决传统Pass@k指标在样本量有限时排名不稳定、易误导的问题。该方法通过估计模型的底层成功概率及其可信区间,提供更稳定且具有统计意义的模型排名,并支持对评分标准的灵活加权。实验表明,该框架在收敛速度和排名稳定性方面优于Pass@k,且能明确区分统计显著差异与噪声,适用于二元和非二元评估场景。

Comments OpenReview (ICLR 2026): https://openreview.net/forum?id=PTXi3Ef4sT

Journal ref The Fourteenth International Conference on Learning Representations (ICLR), 2026

详情
英文摘要

Pass$@k$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limited and computational resources are constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the posterior-based procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://github.com/mohsenhariri/scorio

2510.02043 2026-05-13 cs.CV cs.HC cs.LG

Zero-shot Human Pose Estimation using Diffusion-based Inverse solvers

Sahil Bhandary Karnoor, Romit Roy Choudhury

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了在传感器数量有限的情况下实现零样本人体姿态估计的问题。作者将姿态估计建模为一个逆问题,并提出了一种基于扩散模型的逆求解算法,仅依赖旋转测量信息进行条件生成,同时结合位置测量的似然项进行引导。该方法无需针对每个用户进行微调,实现了跨用户的零样本泛化,为少传感器场景下的姿态估计提供了新思路。

Comments Published as a Conference Paper at The Fourteenth International Conference on Learning Representations

详情
英文摘要

Pose estimation refers to tracking a human's full body posture, including their head, torso, arms, and legs. The problem is challenging in practical settings where the number of body sensors are limited. Past work has shown promising results using conditional diffusion models, where the pose prediction is conditioned on both <location, rotation> measurements from the sensors. Unfortunately, nearly all these approaches generalize poorly across users, primarly because location measurements are highly influenced by the body size of the user. In this paper, we formulate pose estimation as an inverse problem and design an algorithm capable of zero-shot generalization. Our idea utilizes a pre-trained diffusion model and conditions it on rotational measurements alone; the priors from this model are then guided by a likelihood term, derived from the measured locations. Thus, given any user, our proposed InPose method generatively estimates the highly likely sequence of poses that best explains the sparse on-body measurements.

2509.19207 2026-05-13 cs.CV

Long Story Short: Disentangling Compositionality and Long-Caption Understanding in Contrastive VLMs

Israfel Salazar, Desmond Elliott, Yova Kementchedjhieva

发表机构 * Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系) MBZUAI(马克斯·普朗克人工智能研究所)

AI总结 本文研究了对比视觉-语言模型(VLMs)在理解长篇组合性描述时面临的挑战,分析了组合推理与长描述理解之间的关系。通过在不同训练目标、数据集和架构设计下的受控实验,发现两者存在双向但敏感的关联,高质量且具有强视觉支撑的长描述数据有助于同时提升两种能力,而某些架构设计可能限制组合性学习。研究为改进VLM的泛化能力提供了数据选择和模型设计的实用指导。

Comments To be published in Findings of ACL 2026

详情
英文摘要

Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, yet understanding long, compositional captions remains an open challenge. While these capabilities are often assumed to be closely related, the conditions under which they reinforce each other remain unclear. In this paper, we empirically analyze when compositional reasoning and long-caption understanding transfer across tasks, and when this relationship fails. Through controlled experiments across diverse training objectives, datasets, and architectural designs, we find a bidirectional but sensitive relationship between the two capabilities. Models trained on poorly grounded captions or with limited parameter updates fail to generalize, while high-quality long-caption data with strong visual grounding promotes both capabilities simultaneously. We further show that architectural choices aimed at preserving general alignment, such as frozen positional embeddings, can inadvertently limit compositional learning. Our analysis provides actionable guidelines for data selection and model design to improve VLM generalization.

2509.14933 2026-05-13 cs.LG

DAG: A Dual Correlation Network for Time Series Forecasting with Exogenous Variables

Xiangfei Qiu, Yuhan Zhu, Zhengyu Li, Xingjian Wu, Bin Yang, Jilin Hu

发表机构 * School of Data Science and Engineering, East China Normal University, Shanghai, China(数据科学与工程学院,东华大学,上海,中国)

AI总结 时间序列预测在多个领域具有重要意义,而引入外生变量(协变量)可以提供额外的预测信息,提高预测精度。然而,现有方法在利用外生变量,尤其是未来外生变量及其与内生变量之间的相关性方面存在不足。为此,本文提出DAG模型,通过在时间维度和通道维度上构建双重相关性网络,充分挖掘历史外生变量与未来外生变量、历史内生变量之间的相关性,并将其注入到未来内生变量的预测过程中,从而提升时间序列预测的准确性。

Comments Accepted by ICML 2026

详情
英文摘要

Time series forecasting is essential in various domains. Compared to relying solely on endogenous variables (i.e., target variables), considering exogenous variables (i.e., covariates) provides additional predictive information and often leads to more accurate predictions. However, existing methods for time series forecasting with exogenous variables (TSF-X) have the following shortcomings: 1) they do not leverage future exogenous variables, 2) they fail to fully account for the correlation between endogenous and exogenous variables. In this study, to better leverage exogenous variables, especially future exogenous variables, we propose DAG, which utilizes Dual correlAtion network along both the temporal and channel dimensions for time series forecasting with exoGenous variables. Specifically, we propose two core components: the Temporal Correlation Module and the Channel Correlation Module. Both modules consist of a correlation discovery submodule and a correlation injection submodule. The former is designed to capture the correlation effects of historical exogenous variables on future exogenous variables and on historical endogenous variables, respectively. The latter injects the discovered correlation relationships into the processes of forecasting future endogenous variables based on historical endogenous variables and future exogenous variables.

2509.10692 2026-05-13 cs.RO

STL-Based Motion Planning and Uncertainty-Aware Risk Analysis for Human-Robot Collaboration with a Multi-Rotor Aerial Vehicle

Giuseppe Silano, Amr Afifi, Martin Saska, Antonio Franchi

发表机构 * Department of Cybernetics, Czech Technical University in Prague(捷克技术大学布拉格分校控制学系) Department of Power Generation Technologies and Materials, Ricerca sul Sistema Energetico S.p.A.(能源系统研究股份有限公司电力生成技术与材料学系) Robotics and Mechatronics Department, Electrical Engineering, Mathematics, and Computer Science Faculty, University of Twente(代尔夫特理工大学机器人与机电学系,电气工程、数学与计算机科学学院) Department of Computer, Control and Management Engineering, Sapienza University of Rome(罗马萨皮恩扎大学计算机、控制与管理工程系)

AI总结 本文提出了一种基于信号时序逻辑(STL)的运动规划与风险分析框架,旨在提升多旋翼无人机与人类的协作能力。该方法通过STL编码任务中的安全、时间约束和人体舒适性等关键目标,并结合优化规划生成符合无人机动力学约束的可行轨迹,同时引入不确定性感知的风险分析以应对人类姿态的不确定性。实验验证表明,该框架能够在真实操作条件下实现安全、高效且鲁棒的人机协作。

Comments 46 pages, 14 figures

Journal ref Journal of Intelligent & Robotic Systems, 2026

详情
英文摘要

This paper presents a motion planning and risk analysis framework for enhancing human-robot collaboration with a Multi-Rotor Aerial Vehicle. The proposed method employs Signal Temporal Logic to encode key mission objectives, including safety, temporal requirements, and human preferences, with particular emphasis on ergonomics and comfort. An optimization-based planner generates dynamically feasible trajectories while explicitly accounting for the vehicle's nonlinear dynamics and actuation constraints. To address the resulting non-convex and non-smooth optimization problem, smooth robustness approximations and gradient-based techniques are adopted. In addition, an uncertainty-aware risk analysis is introduced to quantify the likelihood of specification violations under human-pose uncertainty. A robustness-aware event-triggered replanning strategy further enables online recovery from disturbances and unforeseen events by preserving safety margins during execution. The framework is validated through MATLAB and Gazebo simulations on an object handover task inspired by power line maintenance scenarios. Results demonstrate the ability of the proposed method to achieve safe, efficient, and resilient human-robot collaboration under realistic operating conditions.

2509.09838 2026-05-13 cs.LG cs.AI

Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives

Reza Asad, Reza Babanezhad, Sharan Vaswani

发表机构 * Simon Fraser University(西蒙弗雷泽大学) Samsung AI(三星人工智能)

AI总结 本文研究了离散动作空间中Soft Actor-Critic(DSAC)算法的局限性,并提出了一种改进的原理性替代方法。作者发现DSAC表现不佳的主要原因是策略和价值函数之间的熵耦合,通过解耦这一部分可以显著提升性能。基于此,他们提出了一种灵活的离策略actor-critic框架,支持新的目标函数,并在理论和实验上证明了其在Atari游戏中的优越性,即使不依赖熵正则化或显式探索机制也能保持稳健表现。

详情
英文摘要

While Soft Actor-Critic (SAC) is highly effective in continuous control, its discrete counterpart (DSAC) performs poorly on challenging discrete-action domains such as Atari. Consequently, starting from DSAC, we revisit the design of actor-critic methods in this setting. First, we determine that the coupling between the actor and critic entropy is the primary reason behind the poor performance of DSAC. We demonstrate that by merely decoupling these components, DSAC's performance significantly improves. Motivated by this insight, we introduce a flexible off-policy actor-critic framework that subsumes DSAC as a special case and yields novel objectives. Our framework allows using an m-step Bellman operator for the critic update, and instantiates the actor objective by combining standard policy optimization methods with entropy regularization. Theoretically, we prove that the proposed methods can guarantee convergence to the optimal regularized value function in the tabular setting, generalizing the results in prior work. Empirically, we evaluate the proposed objectives on standard Atari games. Our ablations indicate that, unlike DSAC, these objectives, including novel ones, perform robustly even without entropy regularization or explicit exploration mechanisms.

2509.06701 2026-05-13 cs.LG cs.AI

Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks

Su Hyeong Lee, Risi Kondor, Richard Ngo

发表机构 * Department of Statistics, University of Chicago(芝加哥大学统计学系) Department of Computer Science, University of Chicago(芝加哥大学计算机科学系)

AI总结 本文提出了一种基于概率建模的智能代理理论,用于理解深度神经网络中的潜在代理子结构。研究通过定义代理的成果分布及其认知效用,结合加权对数混合方法,探讨了代理组合的形成机制,并证明了在特定条件下实现严格共识的可能性。研究还揭示了大型语言模型中代理对齐的现象,表明通过引导良性代理可以诱发对抗性代理,从而为代理型人工智能系统的对齐问题提供了新的数学框架和启示。

Comments Accepted by ICML 2026

详情
英文摘要

We develop a theory of intelligent agency grounded in probabilistic modeling for neural models. Agents are represented as outcome distributions with epistemic utility given by log score, and compositions are defined through weighted logarithmic pooling that strictly improves every member's welfare. We prove that strict unanimity is impossible under linear pooling or in binary outcome spaces, but possible with three or more outcomes. Our framework admits recursive structure via cloning invariance, continuity, and openness, while tilt-based analysis rules out trivial duplication. Finally, we formalize an agentic alignment phenomenon in LLMs using our theory: eliciting a benevolent persona ("Luigi'") induces an antagonistic counterpart ("Waluigi"), while a manifest-then-suppress Waluigi strategy yields strictly larger first-order misalignment reduction than pure Luigi reinforcement alone. These results clarify how developing a principled mathematical framework for how subagents can coalesce into coherent higher-level entities provides novel implications for alignment in agentic AI systems.

2508.21260 2026-05-13 cs.RO eess.SP math.ST stat.TH

Remarks on stochastic cloning and delayed-state filtering

Tara Mina, Lindsey Marinello, John Christian

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文研究了在航空航天导航和机器人领域中处理依赖于先验状态的延迟状态测量的估计问题,重点探讨了随机克隆(SC)方法以及一种被长期忽视的替代方法——延迟状态卡尔曼滤波(DSKF)。研究发现,正确推导的DSKF能够在无需状态扩增的情况下,实现与SC相同的状态和协方差更新,并提供了两种等效的DSKF形式,从不同角度解释了如何在广义卡尔曼滤波框架中处理先验状态测量的相关性。研究还表明,DSKF在计算和存储复杂度上与SC相当,且在某些问题维度下可进一步降低计算和存储成本,澄清了卡尔曼滤波无法处理相关延迟状态测量的误解。

详情
英文摘要

Many estimation problems in aerospace navigation and robotics involve measurements that depend on prior states. A prominent example is odometry, which measures the relative change between states over time. Accurately handling these delayed-state measurements requires capturing their correlations with prior state estimates, and a widely used approach is stochastic cloning (SC), which augments the state vector to account for these correlations. This work revisits a long-established but often overlooked alternative--the delayed-state Kalman filter--and demonstrates that a properly derived filter yields exactly the same state and covariance update as SC, without requiring state augmentation. Moreover, two equivalent formulations of the delayed-state Kalman filter (DSKF) are presented, providing complementary perspectives on how the prior-state measurement correlations can be handled within the generalized Kalman filter. These formulations are shown to be comparable to SC in asymptotic computational and memory complexity, while one DSKF formulation can offer reduced arithmetic and storage costs for certain problem dimensions. Our findings clarify a common misconception that Kalman filter variants are inherently unable to handle correlated delayed-state measurements, demonstrating that an alternative formulation achieves the same results without state augmentation.

2508.20206 2026-05-13 cs.LG cs.AI

Filter then Attend: Improving attention-based Time Series Forecasting with Spectral Filtering

Elisha Dayag, Nhat Thanh Van Tran, Jack Xin

发表机构 * Department of Mathematics(数学系) University of California, Irvine(加州大学 Irvine 分校) Irvine, CA 92617

AI总结 本文研究了如何通过频域滤波改进基于Transformer的长期时间序列预测模型。作者提出在模型输入阶段加入可学习的频域滤波器,以增强模型对不同频率成分的利用能力。实验表明,该方法在多个数据集上提升了预测性能,并且能够减少模型嵌入维度,使模型更小更高效。

详情
英文摘要

Transformer-based models are at the forefront in long time-series forecasting (LTSF). While in many cases, these models are able to achieve state of the art results, they suffer from a bias toward low-frequencies in the data and high computational and memory requirements. Recent work has established that learnable frequency filters can be an integral part of a deep forecasting model by enhancing the model's spectral utilization. These works choose to use a multilayer perceptron to process their filtered signals and thus do not solve the issues found with transformer-based models. In this paper, we establish that adding a filter to the beginning of transformer-based models enhances their performance in long time-series forecasting. We add learnable filters, which only add an additional $\approx 1000$ parameters to several transformer-based models and observe in multiple instances 5-10 \% relative improvement in forecasting performance. Additionally, we find that with filters added, we are able to decrease the embedding dimension of our models, resulting in transformer-based architectures that are both smaller and more effective than their non-filtering base models. We also conduct synthetic experiments to analyze how the filters enable Transformer-based models to better utilize the full spectrum for forecasting.

2508.16070 2026-05-13 cs.CL

Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants

Chongyang Li, Zhiqiang Yuan, Hanbo Bi, Zexi Jia, Jinchao Zhang

发表机构 * Pattern Recognition Center, WeChat AI, Tencent Inc(腾讯人工智能研究院)

AI总结 本文研究如何提升视觉语言模型在导盲系统中的实用性,针对现有模型输出冗余、缺乏环境风险主动评估的问题,提出了一种减少冗余的行走辅助模型WalkVLM-LR。该模型通过引入基于人类偏好的奖励函数优化输出简洁性与准确性,并结合环境感知判别器提升风险评估效率,实验表明其在输出简洁性和时间冗余度方面均优于现有方法。

Comments ICASSP 2026 Best Industry Paper

详情
英文摘要

Approximately 283 million people worldwide live with visual impairments, motivating increasing research into leveraging Visual Language Models (VLMs) to develop effective walking assistance systems for blind and low vision individuals. However, existing VLMs in walking assistant task often have outputs that contain considerable redundancy and extraneous details, adversely affecting users' ability to accurately assess their surroundings. Moreover, these models typically lack the capability to proactively assess environmental risks and adaptively trigger reminders based on the appropriate scene, leading to excessive temporal redundancy. To mitigate output and temporal redundancy, we propose WalkVLM-LR, a walking assistance model with less redundancy. To reduce output redundancy, we introduce four human-preference-based custom reward functions within the GRPO-based reasoning framework to optimize the output in terms of conciseness, fluency, keyword density, and accuracy, thereby producing more informative and streamlined outputs. To minimize temporal redundancy, we incorporate an environment awareness discriminator, which shares the visual encoder with the VLMs to reduce redundant computations and enhance discriminative efficiency, to make WalkVLM-LR assess scene risk levels and minimize unnecessary reminders. Experimental results demonstrate that our method achieves state-of-the-art performance across all evaluation metrics compared with other models, particularly in output conciseness and less temporal redundancy.

2508.14780 2026-05-13 cs.LG cs.IT math.IT

Context Steering: A New Paradigm for Compression-based Embeddings by Synthesizing Relevant Information Features

Guillermo Sarasa, Ana Granados, Francisco de Borja Rodríguez

发表机构 * Grupo de Neurocomputación Biológica(生物神经计算小组) Departamento de Ingeniería Informática(信息工程系) Escuela Politécnica Superior(高级技术学院) Universidad Autónoma de Madrid(马德里自治大学)

AI总结 本文提出了一种名为“上下文引导”(Context Steering)的新方法,用于基于压缩的嵌入表示,通过合成相关性信息特征来提升嵌入对任务的适应性。该方法主动引导特征生成过程,分析每个对象在聚类框架中的关系影响,从而生成定制化的嵌入表示,突出类间差异信息。实验表明,该方法在多种异构数据集上均能生成鲁棒的任务导向嵌入,有效提升了分类和聚类性能。

详情
英文摘要

Compression-based dissimilarities (CD) offer a flexible and domain-agnostic means of measuring similarity by identifying implicit information through redundancies between data objects. However, as similarity features are derived from the data, rather than defined as an input, it often proves difficult to align with the task at hand, particularly in complex clustering or classification settings. To address this issue, we introduce "context steering", a novel methodology that actively guides the feature-shaping process. Instead of passively accepting the emergent data structure (typically a hierarchy derived from clustering CDs), our approach "steers" the process by systematically analyzing how each object influences the relational context within a clustering framework. This process generates a custom-tailored embedding that isolates and amplifies class-distinctive information. We validate this supervised context-steering strategy using Normalized Compression Distance (NCD) and Relative Compression Distance (NRC) combined with hierarchical clustering, and evaluate the learned embeddings through both classification performance and cluster-quality metrics. Experiments on heterogeneous datasets-from text to real-world audio-show that the proposed approach yields robust task-oriented embeddings from compression dissimilarities, moving from traditional transductive uses of distance matrices to an inductive representation that can be applied to unseen data.

2508.10036 2026-05-13 cs.CL cs.AI cs.IR cs.LG

Reflect then Learn: Active Prompting for Information Extraction Guided by Introspective Confusion

Dong Zhao, Yadong Wang, Xiang Chen, Chenxi Wang, Hongliang Dai, Chuanxing Geng, Shengzhong Zhang, Shaoyuan Li, Sheng-Jun Huang

发表机构 * NUAA-MMMI(南京航空航天大学- MMMI)

AI总结 该研究提出了一种名为APIE的主动提示框架,用于指导信息抽取任务中的大语言模型。该方法基于“内省混淆”原则,通过量化格式不确定性和内容不确定性两个维度,评估模型自身的困惑程度,并据此选择最具挑战性和信息量的样本作为少样本示例。实验表明,该方法在多个基准数据集上显著提升了信息抽取的准确性和鲁棒性。

Comments Published at AAAI 2026

详情
英文摘要

Large Language Models (LLMs) show remarkable potential for few-shot information extraction (IE), yet their performance is highly sensitive to the choice of in-context examples. Conventional selection strategies often fail to provide informative guidance, as they overlook a key source of model fallibility: confusion stemming not just from semantic content, but also from the generation of well-structured formats required by IE tasks. To address this, we introduce Active Prompting for Information Extraction (APIE), a novel active prompting framework guided by a principle we term introspective confusion. Our method empowers an LLM to assess its own confusion through a dual-component uncertainty metric that uniquely quantifies both Format Uncertainty (difficulty in generating correct syntax) and Content Uncertainty (inconsistency in extracted semantics). By ranking unlabeled data with this comprehensive score, our framework actively selects the most challenging and informative samples to serve as few-shot exemplars. Extensive experiments on four benchmarks show that our approach consistently outperforms strong baselines, yielding significant improvements in both extraction accuracy and robustness. Our work highlights the critical importance of a fine-grained, dual-level view of model uncertainty when it comes to building effective and reliable structured generation systems.

2508.08420 2026-05-13 cs.LG stat.ML

Regret minimization in Linear Bandits with offline data via extended D-optimal exploration

Sushant Vijayan, Arun Suggala, Karthikeyan Shanmugam, Soumyabrata Pal

发表机构 * School of Technology and Computer Science TIFR(技术与计算机科学学院 TIFR) Google Deepmind(谷歌 Deepmind) Adobe Research(Adobe 研究)

AI总结 本文研究了在拥有离线数据的情况下,如何在线最小化线性强盗问题的累积遗憾。提出了一种名为Offline-Online Phased Elimination (OOPE) 的算法,通过在探索阶段使用扩展的D-最优设计,有效利用离线数据以显著降低在线遗憾。该算法的在线遗憾界为 $\tilde{O}(\sqrt{\deff T \log (|\mathcal{A}|T)} + d^2)$,其中 $\deff$ 表示离线数据中未充分探索的方向数,反映了离线数据的质量。此外,本文还给出了依赖于离线数据质量的最小最大遗憾下界,并通过Frank-Wolfe近似进一步优化了算法的复杂度。

Comments Accepted to TMLR, with J2C certification, link: https://openreview.net/forum?id=4WcK8gKgCi

详情
英文摘要

We consider the problem of online regret minimization in linear bandits with access to prior observations (offline data) from the underlying bandit model. There are numerous applications where extensive offline data is often available, such as in recommendation systems, online advertising. Consequently, this problem has been studied intensively in recent literature. Our algorithm, Offline-Online Phased Elimination (OOPE), effectively incorporates the offline data to substantially reduce the online regret compared to prior work. To leverage offline information prudently, OOPE uses an extended D-optimal design within each exploration phase. OOPE achieves an online regret is $\tilde{O}(\sqrt{\deff T \log \left(|\mathcal{A}|T\right)}+d^2)$. $\deff \leq d)$ is the effective problem dimension which measures the number of poorly explored directions in offline data and depends on the eigen-spectrum $(λ_k)_{k \in [d]}$ of the Gram matrix of the offline data. The eigen-spectrum $(λ_k)_{k \in [d]}$ is a quantitative measure of the \emph{quality} of offline data. If the offline data is poorly explored ($\deff \approx d$), we recover the established regret bounds for purely online setting while, when offline data is abundant ($\Toff >> T$) and well-explored ($\deff = o(1) $), the online regret reduces substantially. Additionally, we provide the first known minimax regret lower bounds in this setting that depend explicitly on the quality of the offline data. These lower bounds establish the optimality of our algorithm in regimes where offline data is either well-explored or poorly explored. Finally, by using a Frank-Wolfe approximation to the extended optimal design we further improve the $O(d^{2})$ term to $O\left(\frac{d^{2}}{\deff} \min \{ \deff,1\} \right)$, which can be substantial in high dimensions with moderate quality of offline data $\deff = Ω(1)$.

2508.05269 2026-05-13 cs.CV

B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding

Changho Choi, Youngwoo Shin, Gyojin Han, Dong-Jae Lee, Junmo Kim

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 该研究提出B4DL,一个用于训练和评估多模态大语言模型(MLLM)在4D激光雷达时空理解能力的新基准。针对4D激光雷达数据在MLLM中应用不足的问题,研究设计了可扩展的数据生成流程,并提出了首个能直接处理原始4D激光雷达数据并与语言理解结合的MLLM模型,为动态户外环境中的时空推理提供了统一解决方案。

Comments Accepted at ACM MM 2025

详情
英文摘要

Understanding dynamic outdoor environments requires capturing complex object interactions and their evolution over time. LiDAR-based 4D point clouds provide precise spatial geometry and rich temporal cues, making them ideal for representing real-world scenes. However, despite their potential, 4D LiDAR remains underexplored in the context of Multimodal Large Language Models (MLLMs) due to the absence of high-quality, modality-specific annotations and the lack of MLLM architectures capable of processing its high-dimensional composition. To address these challenges, we introduce B4DL, a new benchmark specifically designed for training and evaluating MLLMs on 4D LiDAR understanding. In addition, we propose a scalable data generation pipeline and an MLLM model that, for the first time, directly processes raw 4D LiDAR by bridging it with language understanding. Combined with our dataset and benchmark, our model offers a unified solution for spatio-temporal reasoning in dynamic outdoor environments. We provide rendered 4D LiDAR videos, generated dataset, and inference outputs on diverse scenarios at: https://github.com/ccho4702/B4DL

2507.21159 2026-05-13 cs.AI cs.LG cs.MA

MAC: Masked Agent Collaboration Boosts Large Language Model Medical Decision-Making

Zhihao Peng, Liuxin Bao, Yixuan Yuan

发表机构 * School of Automation, Hangzhou Dianzi University(杭州电子大学自动化学院) Department of Electronic Engineering, Chinese University of Hong Kong(香港中文大学电子工程系)

AI总结 该研究提出了一种名为MAC的掩码智能体协作框架,旨在提升大语言模型在医疗决策中的表现。通过帕累托最优智能体构建和跨一致性最大化机制,该方法实现了协作信息的自适应渐进传播,有效提升了医疗决策的准确性与鲁棒性。研究还引入了模型多样性评估和输出一致性筛选策略,以优化智能体协作过程并减少语义不一致带来的影响。

详情
英文摘要

Large language models (LLMs) have proven effective in artificial intelligence, where the multi-agent system (MAS) holds considerable promise for healthcare development by achieving the collaboration of LLMs. However, the absence of a systematic pipeline for agent construction and the rigidity of static collaboration patterns render current MAS-based models vulnerable to collaboration failures, resulting in substantial performance degradation in medical decision-making scenarios. To this end, we propose a novel Masked Agent Collaboration (MAC) framework that harnesses Pareto-optimal agent construction and cross-consistency maximization mechanisms to achieve adaptive progressive propagation of collaborative information, boosting the medical decision-making capacity. Specifically, we first conduct a Pareto-frontier factors analysis towards the LLMs pool to consider their key factors, including the model size, inference time, diversity score, and throughput ratio, where we calculate the similarity between pairwise outputs within an LLM to derive its diversity score. Beyond this analysis, we enable the identification of Pareto-optimal models that balance efficiency and capability, which are subsequently selected as collaborative agents to consider the fundamental trade-offs inherent in practical LLM deployment. Afterward, we measure the pairwise similarity between the outputs from collaborative agents to determine their cross-consistency values, subsequently masking out the agent with the lowest cross-consistency value to eliminate the output that is likely semantically inconsistent. Finally, we conduct collaboration of agents by achieving adaptive progressive propagation, where each agent aggregates the outputs of unmasked agents from the previous layer as its input to generate the corresponding output via prompt engineering.

2507.13625 2026-05-13 cs.AI

Bridging Dual Knowledge Graphs for Multi-Hop Question Answering in Construction Safety

Yuxin Zhang, Xi Wang, Mo Hu, Zhenyu Zhang

发表机构 * organization= Department of Construction Science, College of Architecture, Texas A\&M University, College Station , country= USA

AI总结 本文研究了如何从复杂的建筑安全法规中进行多跳问题回答,以支持自动化合规性检查。为此,提出了一种名为BifrostRAG的双图检索增强生成系统,该系统结合了语言关系和文档结构建模,通过融合图遍历与语义向量搜索的混合检索机制,提升了大语言模型对法规内容和结构的推理能力。实验表明,BifrostRAG在多跳问题数据集上取得了优异的性能,显著优于仅使用向量或仅使用图的基线方法,为复杂技术文档的智能处理提供了可迁移的解决方案。

Comments 22 pages, 13 figures

Journal ref Automation in Construction, Volume 183, March 2026, 106794

详情
英文摘要

Information retrieval and question answering from safety regulations are essential for automated construction compliance checking but are hindered by the linguistic and structural complexity of regulatory text. Many queries are multi-hop, requiring synthesis across interlinked clauses. To address the challenge, this paper introduces BifrostRAG, a dual-graph retrieval-augmented generation (RAG) system that models both linguistic relationships and document structure. The proposed architecture supports a hybrid retrieval mechanism that combines graph traversal with vector-based semantic search, enabling large language models to reason over both the content and the structure of the text. On a multi-hop question dataset, BifrostRAG achieves 92.8% precision, 85.5% recall, and an F1 score of 87.3%. These results significantly outperform vector-only and graph-only RAG baselines, establishing BifrostRAG as a robust knowledge engine for LLM-driven compliance checking. The dual-graph, hybrid retrieval mechanism presented in this paper offers a transferable blueprint for navigating complex technical documents across knowledge-intensive engineering domains.

2507.12002 2026-05-13 cs.LG

Detecting In-Person Conversations in Noisy Real-World Environments with Smartwatch Audio and Motion Sensing

Alice Zhang, Callihan Bertley, Dawei Liang, Edison Thomaz

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 该研究提出了一种基于智能手表音频和运动传感数据的新方法,用于检测现实环境中面对面的口头对话。通过融合麦克风采集的音频信号与六轴惯性传感器数据,研究设计并训练了卷积和注意力机制神经网络,以识别非语言交流特征。实验表明,多模态数据融合显著提升了检测性能,在实验室和半自然场景中分别达到了82.0%和77.2%的宏F1分数,验证了该方法在实际应用中的有效性。

Comments Accepted to ACM Transactions on Intelligent Systems and Technology

详情
英文摘要

Social interactions play a crucial role in shaping human behavior, relationships, and societies. It encompasses various forms of communication, such as verbal conversation, non-verbal gestures, facial expressions, and body language. In this work, we develop a novel computational approach to detect face-to-face verbal conversations, a foundational aspect of human social interactions. We leverage multimodal data captured by a commodity smartwatch, specifically synchronizing microphone audio with 6-axis inertial signals (accelerometer and gyroscope). We design, train, and evaluate convolutional and attention-based neural networks using three different fusion methods to integrate the audio and motion modalities. To validate this framework, we conduct a lab study with 11 participants and a semi-naturalistic study with 24 participants. Our comprehensive evaluation demonstrates that fusing inertial data with audio significantly improves detection performance by capturing non-verbal conversational dynamics. Overall, our framework achieved 82.0$\pm$3.0% macro F1-score when detecting conversations in the lab and 77.2$\pm$1.8% in the semi-naturalistic setting. Lastly, we demonstrate real-time conversation detection by deploying our trained model to a user application running on a commercial smartwatch.

2507.06694 2026-05-13 cs.LG cs.SY eess.SP eess.SY

Heterogeneous Graph Neural Networks for Short-term State Forecasting in Power Systems across Domains and Time Scales: A Hydroelectric Power Plant Case Study

Raffael Theiler, Olga Fink

发表机构 * Intelligent Maintenance and Operations Systems (IMOS)(智能维护与运营系统)

AI总结 本文研究了在多物理域和多时间尺度下,如何利用异构图神经网络进行电力系统短期状态预测的问题。针对传统图神经网络在处理异构传感器数据时的局限性,作者提出了一种基于异构图注意力网络的方法,能够同时建模水力和电气两个领域内及跨领域的传感器关系。实验结果表明,该方法在归一化均方根误差指标上比传统方法平均提升了35.5%,验证了其在多域多时间尺度电力系统状态预测中的有效性。

Comments 25 pages, 9 figures

详情
英文摘要

Accurate short-term state forecasting is essential for efficient and stable operation of modern power systems, especially in the context of increasing variability introduced by renewable and distributed energy resources. As these systems evolve rapidly, it becomes increasingly important to reliably predict their states in the short term to ensure operational stability, support control decisions, and enable interpretable monitoring of sensor and machine behavior. Modern power systems often span multiple physical domains - including electrical, mechanical, hydraulic, and thermal - posing significant challenges for modeling and prediction. Graph Neural Networks (GNNs) have emerged as a promising data-driven framework for system state estimation and state forecasting in such settings. By leveraging the topological structure of sensor networks, GNNs can implicitly learn inter-sensor relationships and propagate information across the network. However, most existing GNN-based methods are designed under the assumption of homogeneous sensor relationships and are typically constrained to a single physical domain. This limitation restricts their ability to integrate and reason over heterogeneous sensor data commonly encountered in real-world energy systems, such as those used in energy conversion infrastructure. In this work, we propose the use of Heterogeneous Graph Attention Networks to address these limitations. Our approach models both homogeneous intra-domain and heterogeneous inter-domain relationships among sensor data from two distinct physical domains - hydraulic and electrical - which exhibit fundamentally different temporal dynamics. Experimental results demonstrate that our method significantly outperforms conventional baselines on average by 35.5% in terms of normalized root mean square error, confirming its effectiveness in multi-domain, multi-rate power system state forecasting.

2507.03622 2026-05-13 cs.LG cs.AI stat.ML

Localising Dropout Variance in Twin Networks

Cooper Doyle

发表机构 * Commonwealth Bank of Australia(澳大利亚联邦银行)

AI总结 该论文研究了如何在双网络模型中定位预测不确定性来源的问题,提出了一种分层方差分解方法,将总预测方差分解为编码器部分和输出头部分。通过独立控制共享编码器和输出头的蒙特卡洛Dropout,能够区分不同来源的不确定性。实验表明,编码器方差在分布偏移时占主导,是预测误差的主要指标,而输出头方差在编码器不确定性控制后才具有信息量,该方法成本低廉,可为数据收集提供实用指导。

Comments 14 pages, 5 figures, 3 tables

详情
英文摘要

Accurate individual treatment-effect estimation demands not only reliable point predictions but also uncertainty measures that help practitioners \emph{locate} the source of model failure. We introduce a layer-wise variance decomposition for deep twin-network models: by toggling Monte Carlo Dropout independently in the shared encoder and the outcome heads, we split total predictive variance into an \emph{encoder component} ($σ_{\mathrm{enc}}^2$) and a \emph{head component} ($σ_{\mathrm{head}}^2$), with $σ_{\mathrm{enc}}^2 + σ_{\mathrm{head}}^2 \approx σ_{\mathrm{tot}}^2$ by the law of total variance. Across three synthetic covariate-shift regimes, the encoder component dominates under distributional shift ($ρ_{\mathrm{enc}}=0.53$) while the head component becomes informative only once encoder uncertainty is controlled. On a real-world twins cohort with induced multivariate shift, only $σ_{\mathrm{enc}}^2$ spikes on out-of-distribution samples and becomes the primary error predictor ($ρ_{\mathrm{enc}}\!\approx\!0.89$), while $σ_{\mathrm{head}}^2$ remains flat. The decomposition adds negligible cost over standard MC Dropout and provides a practical diagnostic for deciding whether to collect more diverse covariates or more outcome data.

2506.23723 2026-05-13 cs.RO

A comprehensive control architecture for semi-autonomous dual-arm robots in agriculture settings

Jozsef Palmieri, Paolo Di Lillo, Stefano Chiaverini, Alessandro Marino

发表机构 * University of Cassino and Southern Lazio Department of Electrical and Information Engineering(卡斯诺大学和南拉齐亚大学电气与信息工程系)

AI总结 本文提出了一种适用于农业场景的半自主双臂机器人的综合控制架构,旨在实现如葡萄采摘等复杂任务。该架构基于16自由度的双臂移动机器人,采用分层二次规划(HQP)方法处理多优先级的等式和不等式约束,同时整合感知系统选择的葡萄串进行采摘。为应对环境不确定性和潜在碰撞,架构还通过HQP框架处理交互力,并支持人工操作员协助完成任务,最终通过实验室和真实葡萄园的广泛测试验证了其有效性。

Journal ref Control Engineering Practice, Vol. 163, 2025

详情
英文摘要

The adoption of mobile robotic platforms in complex environments, such as agricultural settings, requires these systems to exhibit a flexible yet effective architecture that integrates perception and control. In such scenarios, several tasks need to be accomplished simultaneously, ranging from managing robot limits to performing operational tasks and handling human inputs. The purpose of this paper is to present a comprehensive control architecture for achieving complex tasks such as robotized harvesting in vineyards within the framework of the European project CANOPIES. In detail, a 16-DOF dual-arm mobile robot is employed, controlled via a Hierarchical Quadratic Programming (HQP) approach capable of handling both equality and inequality constraints at various priorities to harvest grape bunches selected by the perception system developed within the project. Furthermore, given the complexity of the scenario and the uncertainty in the perception system, which could potentially lead to collisions with the environment, the handling of interaction forces is necessary. Remarkably, this was achieved using the same HQP framework. This feature is further leveraged to enable semi-autonomous operations, allowing a human operator to assist the robotic counterpart in completing harvesting tasks. Finally, the obtained results are validated through extensive testing conducted first in a laboratory environment to prove individual functionalities, then in a real vineyard, encompassing both autonomous and semi-autonomous grape harvesting operations.

2506.22809 2026-05-13 cs.LG cs.AI cs.CL

Learning Adapter Rank via Symmetry Breaking

Cooper Doyle, Andy Hu, Rebecca Chan, Anna Leontjeva

发表机构 * Commonwealth Bank of Australia(澳大利亚联邦银行)

AI总结 该研究针对低秩适配(LoRA)中适配秩坐标不可识别的问题,提出通过变分推断引入对角后验分布,打破LoRA的旋转对称性,从而自动确定适配秩方向的重要性。基于此,研究提出了BayesLoRA,一种在低秩空间直接进行贝叶斯推断的框架,能够同时学习有效的适配秩和预测不确定性,仅需少量额外参数,实验表明其在保持训练成本的同时,实现了更紧凑的预测校准和优于现有低秩稀疏化方法的性能。

Comments 8 pages, 2 figures, 4 tables

详情
英文摘要

Low-rank adaptation is effective partly because downstream updates lie in a low-dimensional subspace, but the latent rank coordinates of LoRA are not identifiable: any invertible reparameterization of the adapter factors leaves the weight update unchanged. We show that variational inference with a diagonal rank-wise posterior turns this non-identifiability into a useful inductive bias. By breaking LoRA's rotational gauge symmetry, the variational objective selects a preferred basis in rank space, enabling automatic relevance determination over rank directions. This yields Low-Rank Variational Dropout (LRVD), a Bayesian framework that performs inference directly in the low-rank adaptation space rather than the ambient weight space. As an instantiation, BayesLoRA jointly learns effective adapter rank and predictive uncertainty with only $\mathcal{O}(r)$ additional parameters. Empirically, BayesLoRA induces stable rank structure aligned with the dominant singular directions of learned updates, yields compact predictive calibration and matches or exceeds strong low-rank sparsification baselines at comparable training cost.

2506.19417 2026-05-13 cs.LG cs.MA

Focusing Influence Mechanism for Multi-Agent Reinforcement Learning

Yisak Park, Sunwoo Lee, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence(人工智能研究生院)

AI总结 在稀疏奖励环境下,多智能体强化学习(MARL)面临协调探索困难的问题。本文提出了一种聚焦影响机制(FIM),通过基于熵的准则引导智能体关注未充分探索的状态空间区域,并利用资格迹保持多智能体在有益区域的持续协作,从而提升联合行为的协调性和持久性。实验表明,FIM在多种MARL基准任务中均能有效提升合作性能,尤其在稀疏奖励场景下表现出显著优势。

Comments 9 technical page followed by references and appendix

详情
英文摘要

Cooperative multi-agent reinforcement learning (MARL) under sparse rewards remains fundamentally challenging because agents often fail to concentrate their influence, leading to insufficiently coordinated exploration. To address this, we propose the Focusing Influence Mechanism (FIM), a framework that encourages agents to focus their influence on under-explored parts of the state space through an entropy-based criterion, while leveraging eligibility traces to enable multiple agents to consistently align and sustain their influence on the same parts of the state space when beneficial, thereby promoting coordinated and persistent joint behavior. By emphasizing under-explored regions of the state space, FIM facilitates more efficient and structured exploration even under extremely sparse rewards. Across diverse MARL benchmarks, FIM consistently improves cooperative performance over strong baselines.

2506.14097 2026-05-13 cs.RO cond-mat.soft physics.comp-ph

Smooth-Rigid-Body Contact as a ReLCP: A Recursively Generated Linear Complementarity Problem

Bryce Palmer, Hasan Metin Aktulga, Tong Gao

发表机构 * Center for Computational Biology, Flatiron Institute(计算生物学中心,Flatiron研究所) Department of Computer Science, Michigan State University(计算机科学系,密歇根州立大学) Department of Mechanical Engineering, Tufts University(机械工程系,塔夫茨大学)

AI总结 本文将光滑刚体之间无摩擦非光滑接触的互补性时间步进方法重新表述为递归生成的线性互补问题(ReLCP),通过一系列维度递增的LCP问题逐步构建。该方法从经典的单约束共享法向有符号距离(SNSD)LCP出发,仅在当前接触集预测的离散时间更新会导致表面穿透时添加单边约束,从而直接作用于光滑几何,保证非穿透性并避免代理表面模型带来的过度采样问题。理论分析表明,在严格凸体和足够小的时间步长下,该方法能够保证有限终止和速度更新的唯一性,数值实验验证了其在大时间步下的稳定性与高效性。

详情
英文摘要

This paper reformulates complementarity-based time-stepping for frictionless nonsmooth contact between smooth rigid bodies as a recursively generated linear complementarity problem (ReLCP), involving a sequence of LCPs of increasing dimension. Starting from a classical single-constraint shared-normal signed-distance (SNSD) LCP, the method adds unilateral constraints only when the discrete-time update predicted by the current contact set would violate nonpenetration of the underlying smooth surfaces. The resulting procedure acts directly on smooth geometry, enforces nonpenetration to a prescribed tolerance, and avoids the oversampling inherent to proxy-surface contact models such as tessellations or multi-sphere decompositions, for which improved geometric fidelity can drive rapid growth in constraint count and cost. For strictly convex bodies, we prove that an initially overlap free configuration with sufficiently small timestep sizes, imply finite termination of the adaptive augmentation, and yield a unique discrete-time velocity update. In the small timestep limit and for any fixed overlap-free discrete state with a fixed geometric overlap tolerance, we prove that the recursion terminates after the initial solve, reducing the method to the classical single-constraint SNSD LCP and retaining the usual consistency of complementarity time-stepping with the underlying differential variational inequality. Numerical tests on colliding ellipsoids, compacting ellipsoid suspensions, growing bacterial colonies, and taut chainmail networks demonstrate stable large-timestep behavior, bounded interpenetration without discretization-induced surface roughness, and substantial reductions in both active constraint counts and runtime relative to representative discrete-surface complementarity formulations.

2506.08902 2026-05-13 cs.LG cs.AI

Intention-Conditioned Flow Occupancy Models

Chongyi Zheng, Seohong Park, Sergey Levine, Benjamin Eysenbach

发表机构 * Princeton University(普林斯顿大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出了一种名为“意图条件流占用模型”(InFOM)的概率模型,用于预测智能体在遥远未来可能访问的状态分布。该模型基于流匹配技术构建,并引入了一个捕捉用户意图的潜在变量,从而提升模型的表达能力并支持通用策略改进。实验表明,InFOM在多个基准任务中相比现有方法,平均回报提升了1.8倍,成功率提高了36%。

Comments ICLR 2026

详情
英文摘要

Large-scale pre-training has fundamentally changed how machine learning research is done today: large foundation models are trained once, and then can be used by anyone in the community (including those without data or compute resources to train a model from scratch) to adapt and fine-tune to specific tasks. Applying this same framework to reinforcement learning (RL) is appealing because it offers compelling avenues for addressing core challenges in RL, including sample efficiency and robustness. However, there remains a fundamental challenge to pre-train large models in the context of RL: actions have long-term dependencies, so training a foundation model that reasons across time is important. Recent advances in generative AI have provided new tools for modeling highly complex distributions. In this paper, we build a probabilistic model to predict which states an agent will visit in the temporally distant future (i.e., an occupancy measure) using flow matching. As large datasets are often constructed by many distinct users performing distinct tasks, we include in our model a latent variable capturing the user intention. This intention increases the expressivity of our model, and enables adaptation with generalized policy improvement. We call our proposed method intention-conditioned flow occupancy models (InFOM). Comparing with alternative methods for pre-training, our experiments on $36$ state-based and $4$ image-based benchmark tasks demonstrate that the proposed method achieves $1.8 \times$ median improvement in returns and increases success rates by $36\%$. Website: https://chongyi-zheng.github.io/infom Code: https://github.com/chongyi-zheng/infom

2506.02215 2026-05-13 cs.RO cs.SY eess.SY

Active inference as a unified model of collision avoidance behavior in human drivers

Julian F. Schumann, Johan Engström, Leif Johnson, Matthew O'Kelly, Joao Messias, Jens Kober, Arkady Zgonnikov

发表机构 * Department of Cognitive Robotics(认知机器人学系) Delft University of Technology(代尔夫特理工大学) Waymo LLC(Waymo公司)

AI总结 本文提出了一种基于主动推断理论的计算认知模型,用于统一解释人类驾驶员在碰撞规避行为中的决策过程。该模型通过最小化自由能来模拟人类在两种典型碰撞场景下的反应,包括前车急刹和对向车辆侧向侵入,并成功复现了多项已有实证研究中的结果,如反应时间、避让策略选择等。研究展示了主动推断作为统一框架在复杂驾驶任务中理解人类行为的潜力。

详情
英文摘要

Collision avoidance -- involving a rapid threat detection and quick execution of the appropriate evasive maneuver -- is a critical aspect of driving. However, existing models of human collision avoidance behavior are fragmented, focusing on specific scenarios or only describing certain aspects of the avoidance behavior, such as response times. This paper addresses these gaps by proposing a novel computational cognitive model of human collision avoidance behavior based on active inference. Active inference provides a unified approach to modeling human behavior: the minimization of free energy. Building on prior active inference work, our model incorporates established cognitive mechanisms such as evidence accumulation to simulate human responses in two distinct collision avoidance scenarios: front-to-rear lead vehicle braking and lateral incursion by an oncoming vehicle. We demonstrate that our model explains a wide range of previous empirical findings on human collision avoidance behavior. Specifically, the model closely reproduces both aggregate results from meta-analyses previously reported in the literature and detailed, scenario-specific effects observed in a recent driving simulator study, including response timing, maneuver selection, and execution. Our results highlight the potential of active inference as a unified framework for understanding and modeling human behavior in complex real-life driving tasks.

2505.20761 2026-05-13 cs.LG stat.ML

Practical estimation of the optimal classification error with soft labels and calibration

Ryota Ushio, Takashi Ishida, Masashi Sugiyama

发表机构 * The University of Tokyo(东京大学) RIKEN AIP(日本科学技术研究院AIP)

AI总结 本文研究了在二分类任务中如何实用且理论严谨地估计最优分类错误率(即贝叶斯错误)。作者在原有基于软标签的方法基础上进行了两个重要扩展:一方面,他们分析了基于硬标签的估计器的偏差性质,揭示其衰减速度与两类条件分布的分离程度相关,并在每实例硬标签数量增加时可能显著优于先前结果;另一方面,他们解决了在软标签被污染的情况下进行估计的问题,指出即使使用校准后的软标签,估计结果仍可能不准确,并提出一种基于等距校准的估计方法,在更弱的假设下仍具有统计一致性。该方法无需具体实例,适用于隐私受限的实际场景。实验验证了方法的有效性。

Comments ICLR 2026 camera-ready version updated; 40 pages, 12 figures; GitHub: https://github.com/RyotaUshio/bayes-error-estimation

详情
英文摘要

While the performance of machine learning systems has experienced significant improvement in recent years, relatively little attention has been paid to the fundamental question: to what extent can we improve our models? This paper provides a means of answering this question in the setting of binary classification, which is practical and theoretically supported. We extend a previous work that utilizes soft labels for estimating the Bayes error, the optimal error rate, in two important ways. First, we theoretically investigate the properties of the bias of the hard-label-based estimator discussed in the original work. We reveal that the decay rate of the bias is adaptive to how well the two class-conditional distributions are separated, and it can decay significantly faster than the previous result suggested as the number of hard labels per instance grows. Second, we tackle a more challenging problem setting: estimation with corrupted soft labels. One might be tempted to use calibrated soft labels instead of clean ones. However, we reveal that calibration guarantee is not enough, that is, even perfectly calibrated soft labels can result in a substantially inaccurate estimate. Then, we show that isotonic calibration can provide a statistically consistent estimator under an assumption weaker than that of the previous work. Our method is instance-free, i.e., we do not assume access to any input instances. This feature allows it to be adopted in practical scenarios where the instances are not available due to privacy issues. Experiments with synthetic and real-world datasets show the validity of our methods and theory. The code is available at https://github.com/RyotaUshio/bayes-error-estimation.

2505.19770 2026-05-13 cs.LG cs.CL

Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO

Ruizhe Shi, Minhak Song, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon S. Du

发表机构 * University of Washington(华盛顿大学) The Hong Kong University of Science and Technology(香港科学与技术大学) Amazon Inc.(亚马逊公司)

AI总结 本文对两阶段强化学习从人类反馈(RLHF)和直接偏好优化(DPO)之间的性能差距进行了细致的理论分析,揭示了这一差距来源于精确优化下的显式表示差距和有限样本下的隐式表示差距。研究指出,在精确优化条件下,奖励模型和策略模型的相对容量会影响最终策略质量,并发现RLHF、DPO或在线DPO在不同模型误设情况下可能各有优劣;而在近似优化条件下,当真实奖励稀疏时,RLHF在恢复有效奖励模型所需的样本数量上具有统计优势,表明两阶段学习在某些场景下更具优势。这些结果为理解RLHF与DPO的性能差异提供了全面的理论依据,并为实际应用中选择合适方法提供了指导。

Comments ICML accepted version

详情
英文摘要

We present a fine-grained theoretical analysis of the performance gap between two-stage reinforcement learning from human feedback~(RLHF) and direct preference optimization~(DPO). Our study decomposes this gap into two sources: the explicit representation gap under exact optimization and the implicit representation gap under finite samples. In the exact optimization setting, we characterize how the relative capacities of the reward and policy model classes influence the final policy qualities. We show that RLHF, DPO, or online DPO can outperform one another depending on type of model mis-specifications. Notably, online DPO can outperform both RLHF and standard DPO when the reward and policy model classes are isomorphic and both mis-specified. In the approximate optimization setting, we provide a concrete construction where the ground-truth reward is sparse and show that RLHF requires significantly fewer samples than DPO to recover an effective reward model, highlighting a statistical advantage of two-stage learning. Together, these results provide a comprehensive understanding of the performance gap between RLHF and DPO under various settings, and offer practical insights into when each method is preferred.