arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪
2510.24145 2026-05-13 cs.AI

OpsAgent: An Evolving Multi-agent System for Incident Management in Microservices

Yu Luo, Jiamin Jiang, Jingfei Feng, Lei Tao, Qingliang Zhang, Xidao Wen, Yongqian Sun, Shenglin Zhang, Dan Pei

发表机构 * Nankai University(南开大学) Alibaba Cloud(阿里云) Lenovo(联想) Tsinghua University(清华大学)

AI总结 OpsAgent 是一个用于微服务系统故障管理的轻量级、自我进化的多智能体系统。该系统通过无训练数据处理器将异构的可观测性数据转化为结构化文本描述,并结合多智能体协作框架实现透明、可审计的诊断推理。为支持持续能力提升,OpsAgent 引入了内部模型更新与外部经验积累相结合的双重自进化机制,实验表明其在性能、可解释性、成本效率和自进化能力方面均表现优异,具备实际部署和长期运行的可行性。

详情
英文摘要

Incident management (IM) is central to the reliability of large-scale microservice systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world microservice systems. Notably, its deployment in Lenovo's production environment further validates its effectiveness in real-world industrial settings.

2510.09333 2026-05-13 cs.LG cs.CV

Efficient Bayesian Inference from Noisy Pairwise Comparisons

Till Aczel, Lucas Theis, Roger Wattenhofer

发表机构 * ETH Zurich, Switzerland(苏黎世联邦理工学院)

AI总结 本文研究了如何从带有噪声的人类成对比较数据中高效进行贝叶斯推断,以评估生成模型的质量。作者提出了一种名为 BBQ 的贝叶斯 Bradley-Terry 模型变体,该方法显式建模评分者质量,过滤不可靠评分者,并通过期望最大化算法保证似然函数的单调收敛。实验表明,BBQ 能在噪声或众包评分环境下提供更高效、鲁棒且可解释的模型排序与不确定性估计。

详情
英文摘要

Evaluating generative models is challenging because standard metrics often fail to reflect human preferences. Human evaluations are more reliable but costly and noisy, as participants vary in expertise, attention, and diligence. Pairwise comparisons improve consistency, yet aggregating them into overall quality scores requires careful modeling. Bradley-Terry-based methods update item scores from comparisons, but existing approaches either ignore rater variability or lack convergence guarantees, limiting robustness and interpretability. We introduce BBQ, a Bayesian Bradley-Terry variant that explicitly models rater quality, downweighting or removing unreliable participants, and provides guaranteed monotonic likelihood convergence through an Expectation-Maximization algorithm. Empirical results show that BBQ provides efficient inference, well-calibrated uncertainty estimates, and more robust, interpretable rankings compared to baseline Bradley-Terry models, even with noisy or crowdsourced raters. This framework enables more reliable and cost-effective human evaluation of generative models.

2510.03853 2026-05-13 cs.CV

UGround: Towards Unified Visual Grounding with Unrolled Transformers

Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) Zhejiang University(浙江大学)

AI总结 UGround 提出了一种统一的视觉 grounding 框架,通过在展开的 Transformer 层中动态选择中间层作为“掩码作为提示”,克服了传统方法中固定使用最后一层隐藏状态的问题。该方法引入了策略驱动的掩码机制,包含随机跳过连接和掩码作为提示两个核心组件,实现了对视觉模型(如 SAM)的动态引导与空间线索的显式传递。UGround 在统一框架下覆盖了多种视觉 grounding 任务,包括属性层面的传统指代分割和新提出的推理分割等,显著提升了模型的灵活性和适用性。

Comments This work has been accepted to ICML 2026, please refer to https://github.com/rui-qian/UGround

详情
英文摘要

We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt,'' diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{<SEG>} as prompt.'' UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{<SEG>} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (e.g., coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{<SEG>} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (e.g., SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{<SEG>} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All code and models are publicly available at https://github.com/rui-qian/UGround.

2510.03206 2026-05-13 cs.AI cs.CL

Coevolutionary Continuous Discrete Diffusion: Make Your Diffusion Language Model a Latent Reasoner

Cai Zhou, Chenxiao Yang, Yi Hu, Chenyu Wang, Chubin Zhang, Muhan Zhang, Lester Mackey, Tommi Jaakkola, Stephen Bates, Dinghuai Zhang

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Microsoft Research(微软研究院) Toyota Technological Institute at Chicago(丰田技术研究所(芝加哥)) Peking University(北京大学) Tsinghua University(清华大学)

AI总结 该论文研究了扩散语言模型在离散与连续空间中的表现差异,指出尽管连续扩散模型在理论上具有更强的表达能力,但在实际应用中往往不如离散模型。为此,作者提出了协同进化连续离散扩散(CCDD)方法,通过在连续表示空间和离散词元空间上定义联合扩散过程,结合两者优势,既保留了连续空间的语义丰富性,又借助离散词元提升训练和采样效果。实验表明,CCDD在多项现实任务的语言建模中表现出色。

Comments 29 pages. Accepted to ICML 2026

详情
英文摘要

Diffusion language models, especially masked discrete diffusion models, have achieved great success recently. While there are some theoretical and primary empirical results showing the advantages of latent reasoning with looped transformers or continuous chain-of-thoughts, continuous diffusion models typically underperform their discrete counterparts. In this paper, we argue that diffusion language models do not necessarily need to be in the discrete space. In particular, we prove that continuous diffusion models have stronger expressivity than discrete diffusions and looped transformers. We attribute the contradiction between the theoretical expressiveness and empirical performance to their practical trainability: while continuous diffusion provides intermediate supervision that looped transformers lack, they introduce additional difficulty decoding tokens into the discrete token space from the continuous representation space. We therefore propose Coevolutionary Continuous Discrete Diffusion (CCDD), which defines a joint multimodal diffusion process on the union of a continuous representation space and a discrete token space, leveraging a single model to simultaneously denoise in the joint space. By combining two modalities, CCDD is expressive with rich semantics in the latent space, as well as good trainability and sample quality with the help of explicit discrete tokens. We also propose effective architectures and advanced training/sampling techniques for CCDD, which reveals strong empirical performance in extensive language modeling experiments on real-world tasks.

2510.02107 2026-05-13 cs.LG

PENEX: AdaBoost-Inspired Neural Network Regularization

Klaus-Rudolf Kladny, Bernhard Schölkopf, Michael Muehlebach

发表机构 * MPI for Intelligent Systems(智能系统研究所) Tübingen AI Center(图宾根人工智能中心) ELLIS Institute Tübingen(图宾根ELLIS研究所)

AI总结 本文提出了一种受AdaBoost启发的神经网络正则化方法PENEX,通过改进多分类指数损失函数,使其适用于一阶优化方法,从而更有效地用于神经网络训练。PENEX通过增大数据点的边距来提升模型的泛化能力,并在低数据量场景下表现出优于传统正则化方法的性能。研究展示了指数损失在AdaBoost之外的广泛应用潜力。

详情
英文摘要

AdaBoost sequentially fits so-called weak learners to minimize an exponential loss, which penalizes misclassified data points more severely than other loss functions like cross-entropy. Paradoxically, AdaBoost generalizes well in practice as the number of weak learners grows. In the present work, we introduce Penalized Exponential Loss (PENEX), a new formulation of the multi-class exponential loss that is theoretically grounded and, in contrast to the existing formulation, amenable to optimization via first-order methods, making it a practical objective for training neural networks. We demonstrate that PENEX effectively increases margins of data points, which can be translated into a generalization bound. Empirically, across computer vision and language tasks, PENEX improves neural network generalization in low-data regimes, matching and in some settings outperforming established regularizers at comparable computational cost. Our results highlight the potential of the exponential loss beyond its application in AdaBoost.

2510.00733 2026-05-13 cs.LG cs.AI q-bio.QM

Neural Diffusion Processes for Physically Interpretable Survival Prediction

Alessio Cristofoletto, Cesare Rollo, Giovanni Birolo, Piero Fariselli

发表机构 * Department of Computing Sciences, Bocconi University, Milano, Italy(博科尼大学计算科学系,米兰,意大利) Computational Biomedicine Unit, University of Torino, Torino, Italy(都灵大学计算生物医学单元,都灵,意大利)

AI总结 本文提出了一种名为DeepFHT的生存分析框架,将深度神经网络与随机过程理论中的首次穿越时间(FHT)分布相结合,将事件发生时间建模为潜在扩散过程首次到达吸收边界的时间。该方法通过神经网络将输入变量映射到具有物理意义的参数,如初始条件、漂移和扩散系数,从而在无需假设比例风险的前提下,生成闭式生存和风险函数。实验表明,DeepFHT在预测性能上与现有先进方法相当,同时保持了物理可解释的参数化特性,有助于揭示输入特征与风险之间的关系。

Comments 12 pages, 5 figures

详情
英文摘要

We introduce DeepFHT, a survival-analysis framework that couples deep neural networks with first hitting time (FHT) distributions from stochastic process theory. Time to event is represented as the first passage of a latent diffusion process to an absorbing boundary. A neural network maps input variables to physically meaningful parameters including initial condition, drift, and diffusion, within a chosen FHT process such as Brownian motion, both with drift and driftless. This yields closed- form survival and hazard functions and captures time-varying risk without assuming proportional- hazards. We compare DeepFHT with Cox regression using synthetic and real-world datasets. The method achieves predictive accuracy on par with the state-of-the-art approach, while maintaining a physics- based interpretable parameterization that elucidates the relation between input features and risk. This combination of stochastic process theory and deep learning provides a principled avenue for modeling survival phenomena in complex systems

2509.25239 2026-05-13 cs.AI cs.CL cs.LG

A Formal Comparison Between Chain of Thought and Latent Thought

Kevin Xu, Issei Sato

发表机构 * Department of Computer Science, The University of Tokyo, Japan(东京大学计算机科学系)

AI总结 本文对比了链式推理(Chain of Thought, CoT)与隐式推理(Latent Thought)两种大语言模型的推理方法。CoT通过显式生成中间token进行推理,而隐式推理则在连续的潜在空间中直接进行计算,支持超越离散语言表示的运算。研究发现,隐式推理在并行计算效率上更具优势,而CoT则在随机解码下支持近似计数和采样,为不同任务选择合适的推理范式提供了理论依据。

Comments Camera-ready version for ICML 2026

详情
英文摘要

Chain of thought (CoT) elicits reasoning in large language models by explicitly generating intermediate tokens. In contrast, latent thought reasoning operates directly in the continuous latent space, enabling computation beyond discrete linguistic representations. While both approaches exploit iterative computation, their comparative capabilities remain underexplored. In this work, we present a formal analysis showing that latent thought admits more efficient parallel computation than inherently sequential CoT. In contrast, CoT enables approximate counting and sampling through stochastic decoding. These separations suggest the tasks for which depth-driven recursion is more suitable, thereby offering practical guidance for choosing between reasoning paradigms.

2509.22414 2026-05-13 cs.CV

LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer

Song Fei, Tian Ye, Lujia Wang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文提出了一种无需图像描述的高保真图像修复方法LucidFlux,通过适配大规模扩散变换器Flux.1实现真实感图像恢复。该方法引入了一个轻量的双分支条件器,分别注入退化输入和轻度修复代理的信号以锚定几何结构并抑制伪影,并设计了时序和层自适应的调制调度策略,实现从粗到细的上下文感知更新。此外,通过SigLIP特征实现无需描述的语义对齐,并结合可扩展的数据筛选流程,LucidFlux在多个基准测试中优于现有开源和商业方法,验证了其在复杂场景下鲁棒且无需文本提示的图像修复能力。

Comments Project Page: https://w2genai-lab.github.io/LucidFlux

详情
英文摘要

Image restoration (IR) aims to recover images degraded by unknown mixtures while preserving semanticsconditions under which discriminative restorers and UNet-based diffusion priors often oversmooth, hallucinate, or drift. We present LucidFlux, a caption-free IR framework that adapts a large diffusion transformer (Flux.1) without image captions. Our LucidFlux introduces a lightweight dual-branch conditioner that injects signals from the degraded input and a lightly restored proxy to respectively anchor geometry and suppress artifacts. Then, a timestep- and layer-adaptive modulation schedule is designed to route these cues across the backbones hierarchy, in order to yield coarse-to-fine and context-aware updates that protect the global structure while recovering texture. After that, to avoid the latency and instability of text prompts or Vision-Language Model (VLM) captions, we enforce caption-free semantic alignment via SigLIP features extracted from the proxy. A scalable curation pipeline further filters large-scale data for structure-rich supervision. Across synthetic and in-the-wild benchmarks, our LucidFlux consistently outperforms strong open-source and commercial baselines, and ablation studies verify the necessity of each component. LucidFlux shows that, for large DiTs, when, where, and what to condition onrather than adding parameters or relying on text promptsis the governing lever for robust and caption-free image restoration in the wild.

2509.20899 2026-05-13 cs.CV

Concepts in Motion: Temporal Concept Bottleneck Model for Interpretable Video Classification

Patrick Knab, Sascha Marton, Philipp J. Schubert, Drago Guggiana, Christian Bartelt

发表机构 * Technical University of Clausthal(Clausthal 技术大学) Ramblr.ai Research(Ramblr.ai 研究)

AI总结 本文提出了一种名为MoTIF的可解释视频分类方法,通过引入基于时序概念激活的Transformer架构,解决了在视频中提取和建模概念的挑战。该方法利用每个概念的时序自注意力机制,捕捉概念随时间的变化规律及其对分类结果的贡献,并通过一个基于视觉-语言模型的概念发现模块,从训练视频中自动提取与物体和动作相关的文本概念,无需人工标注。实验表明,该方法在多个视频基准上优于全局概念瓶颈模型,并在可解释性框架下保持了良好的性能。

详情
英文摘要

Concept Bottleneck Models (CBMs) enable interpretable image classification by structuring predictions around human-understandable concepts, but extending this paradigm to video remains challenging due to the difficulty of extracting concepts and modeling them over time. In this paper, we introduce MoTIF (Moving Temporal Interpretable Framework), a transformer-based concept architecture that operates on sequences of temporally grounded concept activations, by employing per-concept temporal self-attention to model when individual concepts recur and how their temporal patterns contribute to predictions. Central to the framework is a class-conditioned VLM-based concept discovery module that extracts object- and action-centric textual concepts from training videos, yielding temporally expressive concept sets without manual concept annotation. Across multiple video benchmarks, this combination improves over global concept bottlenecks and remains competitive within the interpretable concept-bottleneck setting, while narrowing the gap to strong black-box video baselines that we report as contextual references. Code available at github.com/patrick-knab/MoTIF.

2509.13548 2026-05-13 cs.SD eess.AS stat.ML

Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers

Manan Mittal, Thomas Deppisch, Joseph Forrer, Chris Le Sueur, Zamir Ben-Hur, David Lou Alon, Daniel D. E. Wong

发表机构 * Stony Brook University(史泰森布鲁克大学) Chalmers University of Technology(挑战大学) Reality Labs Research, Meta(现实实验室研究,Meta)

AI总结 本文提出了一种基于专家混合框架的新型方法,用于增强移动说话人声源的视野感知双耳渲染。该方法通过隐式定位在线融合多个双耳滤波器,实现了对连续运动声源的实时追踪与增强,能够在保持自然双耳线索的同时,突出或抑制特定方向的声音。与传统依赖到达方向估计或基于Ambisonics域的方法不同,该信号依赖框架具有阵列结构无关性,适用于下一代消费音频设备中的空间音频捕获与个性化播放。

Comments 5 pages, 3 figures

详情
英文摘要

We propose a novel mixture of experts framework for field-of-view enhancement in binaural signal matching. Our approach enables dynamic spatial audio rendering that adapts to continuous talker motion, allowing users to emphasize or suppress sounds from selected directions while preserving natural binaural cues. Unlike traditional methods that rely on explicit direction-of-arrival estimation or operate in the Ambisonics domain, our signal-dependent framework combines multiple binaural filters in an online manner using implicit localization. This allows for real-time tracking and enhancement of moving sound sources, supporting applications such as speech focus, noise reduction, and world-locked audio in augmented and virtual reality. The method is agnostic to array geometry offering a flexible solution for spatial audio capture and personalized playback in next-generation consumer audio devices.

2507.16818 2026-05-13 cs.LG

Evaluating Artificial Intelligence Algorithms for the Standardization of Transtibial Prosthetic Socket Shape Design

C. H. E. Jordaan, M. van der Stelt, T. J. J. Maal, V. M. A. Stirler, R. Leijendekkers, T. Kachman, G. A. de Jong

发表机构 * D Lab, Radboud University Medical Centre(radboud大学医学中心3D实验室) Department of Trauma Surgery, Radboud University Medical Centre(radboud大学医学中心创伤外科部门) Department of Rehabilitation, Radboud University Medical Centre(radboud大学医学中心康复部门) Donders Centre for Cognition, Radboud University(radboud大学认知中心) Military Health Organisation, Ministry of Defence, Kromhout Kazerne(国防部军事医疗组织,克罗姆霍特凯尔内)

AI总结 该研究旨在利用人工智能算法标准化截肢者假肢套筒的设计过程,以减少对假肢师经验的依赖。研究基于118名患者的三维残肢扫描和对应的假肢套筒模型,采用形态模型和主成分分析进行数据预处理,并开发了三种算法预测套筒形状或假肢师的调整方案。结果表明,预测调整方案的算法在精度上优于直接预测最终形状,其中随机森林模型表现最佳,表面到表面距离中位数仅为1.24毫米。

Journal ref Computer Methods and Programs in Biomedicine Update 9: 2026

详情
英文摘要

The quality of a transtibial prosthetic socket depends on the prosthetist's skills and expertise, as the fitting is performed manually. This study investigates multiple artificial intelligence (AI) approaches to help standardize transtibial prosthetic socket design. Data from 118 patients were collected by prosthetists working in the Dutch healthcare system. This data consists of a three-dimensional (3D) scan of the residual limb and a corresponding 3D model of the prosthetist-designed socket. Multiple data pre-processing steps are performed for alignment, standardization and optionally compression using Morphable Models and Principal Component Analysis. Afterward, three different algorithms - a 3D neural network, Feedforward neural network, and random forest - are developed to either predict 1) the final socket shape or 2) the adaptations performed by a prosthetist to predict the socket shape based on the 3D scan of the residual limb. Each algorithm's performance was evaluated by comparing the prosthetist-designed socket with the AI-generated socket, using two metrics in combination with the error location. First, we measure the surface-to-surface distance to assess the overall surface error between the AI-generated socket and the prosthetist-designed socket. Second, distance maps between the AI-generated and prosthetist sockets are utilized to analyze the error's location. For all algorithms, estimating the required adaptations outperformed direct prediction of the final socket shape. The random forest model applied to adaptation prediction yields the lowest error with a median surface-to-surface distance of 1.24 millimeters, a first quartile of 1.03 millimeters, and a third quartile of 1.54 millimeters.

2507.13841 2026-05-13 cs.CL

The Challenge and Reward of Fair Play in Narrative: A Computational Approach

Eitan Wagner, Renana Keydar, Omri Abend

发表机构 * Department of Computer Science Hebrew University of Jerusalem(计算机科学系 Hebrew University of Jerusalem) Department of Law and Digital Humanities Hebrew University of Jerusalem(法律与数字人文系 Hebrew University of Jerusalem)

AI总结 本文研究叙事中“意外性”与“连贯性”之间的平衡问题,提出一种基于信息论的理论框架,并以推理小说为案例进行分析。研究发现,对于单一读者模型,这两种特性存在权衡关系,但在区分“揭示前”和“揭示后”两种阅读模式后,二者可以共存。文章进一步提出“公平性”作为叙事质量的重要标准,并利用大语言模型进行实验验证,结果表明实现公平性对模型是一个挑战,且意外性与连贯性在不同故事中并不正相关。

Comments 47 pages, 11 figures, 13 tables

详情
英文摘要

Good storytelling involves surprise -- unpredictability in how the story unfolds -- and sense-making, the requirement that the story forms a coherent sequence. However, to date, these two qualities have largely been addressed in isolation. We formalize these qualities and their relationship in an information-theoretic framework, using detective fiction as a paradigm case of narratives in which a hidden truth is discovered through reasoning. Our central theoretical result shows that surprise and coherence must trade off for any *single* reader model, but can coexist when two reader modes are distinguished: a pre-revelation mode that forms expectations while the ending is unknown, and a post-resolution hindsight mode that re-evaluates the story after the culprit is revealed. The balance of these two dimensions is realized in the common requirement of *fair play*, giving the reader a chance to solve the mystery while maintaining a challenge. We operationalize the framework using large language models as simulated readers, and define reference-less evaluation metrics for surprise, coherence, and fair play. Experiments on LLM-generated stories validate our theoretical predictions: while models generally succeed in creating surprise or coherence, achieving fair play poses a challenge even for strong models. Moreover, surprise and coherence do not positively correlate across stories, resisting reduction to a single latent quality. A human study validates the metrics, confirming they capture aspects of narrative quality that matter to readers. Our metrics also reproduce established literary intuitions, finding Christie's stories more surprising and more fair-playing than Conan Doyle's.

2506.13163 2026-05-13 cs.LG

Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

Tanmay Goyal, Gaurav Sinha

发表机构 * Microsoft Research India(微软印度研究院)

AI总结 本文研究了逻辑上下文滑块老虎机问题,其中智能体在每一轮从指数级大的候选滑块集合中选择一个包含 $N$ 个项目的滑块,并仅观测到由逻辑模型决定的单个二元奖励。为在 $T$ 轮中最大化累积奖励并保持低计算开销,作者提出了两种高效算法 Slate-GLM-OFU 和 Slate-GLM-TS,它们通过局部规划实现每轮 $N^{O(1)}$ 的时间复杂度,并通过全局学习保证低悔恨。理论分析和实验表明,这些算法在多种合成场景中表现优异,并成功应用于语言模型的上下文示例选择任务,取得了有竞争力的测试准确率。

Comments Accepted to UAI 2025

详情
英文摘要

We study the Logistic Contextual Slate Bandit problem, where, at each round, an agent selects a slate of $N$ items from an exponentially large set (of size $2^{Ω(N)}$) of candidate slates provided by the environment. A single binary reward, determined by a logistic model, is observed for the chosen slate. Our objective is to develop algorithms that maximize cumulative reward over $T$ rounds while maintaining low per-round computational costs. We propose two algorithms, Slate-GLM-OFU and Slate-GLM-TS, that accomplish this goal. These algorithms achieve $N^{O(1)}$ per-round time complexity via local planning (independent slot selections), and low regret through global learning (joint parameter estimation). We provide theoretical and empirical evidence supporting these claims. Under a well-studied diversity assumption, we prove that Slate-GLM-OFU incurs only $\tilde{O}(\sqrt{T})$ regret. Extensive experiments across a wide range of synthetic settings demonstrate that our algorithms consistently outperform state-of-the-art baselines, achieving both the lowest regret and the fastest runtime. Furthermore, we apply our algorithm to select in-context examples in prompts of Language Models for solving binary classification tasks such as sentiment analysis. Our approach achieves competitive test accuracy, making it a viable alternative in practical scenarios.

2506.09044 2026-05-13 cs.LG

Strategically Deceptive Model Deployment in Performative Prediction

Javier Sanguino Bautiste, Thomas Kehrenberg, Jose A. Lozano, Novi Quadrianto

发表机构 * Basque Center for Applied Mathematics(巴斯克应用数学中心) University of the Basque Country UPV/EHU(巴斯克国家大学) University of Sussex(苏塞克斯大学)

AI总结 本文研究了在行为预测(Performative Prediction)场景中,机构通过部署与用户行为响应模型不一致的模型,从而实现策略性欺骗部署的问题。提出了一种新的框架——解耦行为预测(DPP),用于建模机构决策模型与用户响应模型之间的不匹配,并证明该框架可以带来更低的风险。研究还引入了“欺骗成本”作为衡量用户受欺骗程度的指标,并分析了机构在声誉或用户流失压力下引入该成本进行优化的局限性,强调模型披露不仅是伦理问题,更是关键技术设计决策,亟需相关监管。

Comments Accepted to FAccT 2026

详情
英文摘要

Machine Learning systems are increasingly deployed in decision-making settings that shape user behavior and, in turn, the data on which future decisions are based. Performative Prediction (PP) formalizes this feedback loop by modeling how deployed models induce distributional shifts. It studies how to learn robust and well-performing models under such dynamics. However, existing PP frameworks typically assume that the model governing these decisions is the same model observed by users (therefore, to which they respond). In practice, deployer institutions may instead disclose curated models, while internally relying on distinct opaque models. We introduce Decoupled Performative Prediction (DPP), a framework that explicitly models mismatches between the model governing institutional decisions and the model that shapes user behavior. By analyzing the resulting optimization landscape, we show that DPP admits new different solutions that provably achieve lower risk for the institution than those under classical PP. We further propose an algorithm with provable convergence guarantees under standard assumptions, demonstrating how easy institutions can benefit from strategically deceptive deployment when they control model disclosure and users lack countervailing power. To capture the implications of such behavior, we introduce the deception cost, a quantitative measure of the degree of deception experienced by users. We study settings in which institutions incorporate this cost into the optimization process, motivated by reputational concerns or potential user abandonment, and show that such self-imposed constraints are insufficient to protect users. Overall, our results demonstrate that model disclosure is not merely an ethical consideration but a core technical design decision, underscoring the need for regulations that hold institutions accountable for deceptive deployment practices.

2506.02084 2026-05-13 cs.LG stat.ML

Adversarial Causal Tuning for Realistic Time-series Generation

Nikolaos Gkorgkolis, Nikolaos Kougioulis, MingXue Wang, Bora Caglayan, Andrea Tonon, Dario Simionato, Ioannis Tsamardinos

发表机构 * University of Crete & FORTH(希腊克里特大学及希腊国家科研院) Huawei Ireland Research Centre(华为爱尔兰研究中心)

AI总结 本文研究如何从真实时间序列数据中生成具有相同观测和干预分布的仿真数据,旨在构建概率因果数字孪生模型。为此,作者提出了一种对抗因果调优(ACT)方法,结合生成对抗网络和自动机器学习的思想,搜索最优的因果模型和判别器,以提升生成数据与真实数据分布的一致性,并通过置换检验控制模型复杂度。实验表明,ACT在多个数据集上表现出优越的拟合能力和泛化性能,为现实时间序列的生成提供了新的有效方法。

Comments 22 pages, 3 figures

详情
英文摘要

We address the problem of generating simulated, yet realistic, time-series data from a causal model with the same observational and interventional distributions as a given real dataset (probabilistic causal digital twin). While non-causal models (e.g., GANs) also strive to simulate realistic data, causal models are fundamentally more powerful, able to simulate the effect of interventions (what-if scenarios), optimize decisions, perform root-cause analysis, and counterfactual causal reasoning. We introduce the Adversarial Causal Tuning (ACT) methodology, which outputs the optimal causal model that fits the data, along with a quantification of the goodness-of-fit. The returned causal model can then be employed to simulate new data or to perform other causal reasoning tasks. ACT adopts ideas from Generative Adversarial Network training and AutoML to search for optimal causal pipelines and discriminators that detect deviations between the distributions of real and simulated data. It also adapts a permutation testing procedure from established causal tuning methods to penalize models for complexity. Through extensive experiments on real, semi-synthetic, and synthetic datasets, we show that (a) employing multiple optimized discriminators is paramount for selecting the optimal causal models and quantifying goodness-of-fit, (b) ACT selects the optimal causal model in synthetic datasets while avoiding overfitting, generating data indistinguishable from the true data distribution (c) all state-of-the-art generative and causal simulation methods, exhibit room for improvement in reproducing real data distributions; generating realistic temporal data is still an open research challenge.

2506.01568 2026-05-13 cs.LG cs.RO

Trajectory First: A Curriculum for Discovering Diverse Policies

Cornelius V. Braun, Sayantan Auddy, Marc Toussaint

发表机构 * TU Berlin(柏林技术大学) Robotics Institute(机器人研究所)

AI总结 本文提出了一种两阶段的课程学习方法,旨在提升强化学习中智能体行为的多样性。该方法首先引入基于样条的轨迹先验作为归纳偏置,生成多样且高回报的行为策略,随后将其蒸馏为反应式的分步策略。实验表明,该课程学习框架在保持任务性能的同时,显著提升了所学技能的多样性。

Comments Accepted into the Inductive Biases in Reinforcement Learning Workshop at RLC 2025

详情
英文摘要

Being able to solve a task in diverse ways makes agents more robust to task variations and less prone to local optima. In this context, constrained diversity optimization has become a useful reinforcement learning (RL) framework for training a set of diverse agents in parallel. However, existing constrained-diversity RL methods often under-explore in complex tasks such as robot manipulation, resulting in limited behavioral diversity. We address this with a two-stage curriculum that introduces a spline-based trajectory prior as an inductive bias to produce diverse, high-reward behaviors in an initial stage, and then distills these behaviors into reactive, step-wise policies in a second stage. In our empirical evaluation, we provide novel insights into challenges of diversity-targeted training and show that our curriculum increases the diversity of learned skills while maintaining high task performance.

2505.20535 2026-05-13 cs.LG

Rotary Masked Autoencoders are Versatile Learners

Uros Zivanovic, Serafina Di Gioia, Andre Scaffidi, Martín de los Rios, Gabriella Contardo, Roberto Trotta

发表机构 * University of Trieste(特里埃斯特大学) Abdus Salam International Centre for Theoretical Physics (ICTP)(阿布杜斯·萨拉姆国际理论物理学中心(ICTP)) Scuola Internazionale Superiore di Studi Avanzati (SISSA)(国际先进研究高等学院(SISSA)) University of Nova Gorica(诺瓦戈里察大学) INFN – National Institute for Nuclear Physics(意大利国家核物理研究所(INFN)) ICSC - Centro Nazionale di Ricerca in High Performance Computing(高性能计算国家研究中心(ICSC)) Imperial College London(伦敦帝国理工学院)

AI总结 该论文提出了一种名为Rotary Masked Autoencoder(RoMAE)的新型自编码器,旨在解决传统Transformer在处理不规则时间序列时需要特殊架构设计的问题。RoMAE结合了旋转位置嵌入(RoPE)方法,能够在无需特定时间序列结构的情况下,对多维连续位置信息进行插值和表征学习。实验表明,RoMAE在不规则时间序列、图像和音频等多种模态任务中均表现出色,尤其在复杂数据集上超越了专门的时间序列模型,同时保持了MAE在其他模态中的良好性能。

Comments NeurIPS 2025 Final Camera Ready

Journal ref Advances in Neural Information Processing Systems 38, NeurIPS 2025, Pages 133952-133987

详情
英文摘要

Applying Transformers to irregular time-series typically requires specializations to their baseline architecture, which can result in additional computational overhead and increased method complexity. We present the Rotary Masked Autoencoder (RoMAE), which utilizes the popular Rotary Positional Embedding (RoPE) method for continuous positions. RoMAE is an extension to the Masked Autoencoder (MAE) that enables interpolation and representation learning with multidimensional continuous positional information while avoiding any time-series-specific architectural specializations. We showcase RoMAE's performance on a variety of modalities including irregular and multivariate time-series, images, and audio, demonstrating that RoMAE surpasses specialized time-series architectures on difficult datasets such as the DESC ELAsTiCC Challenge while maintaining MAE's usual performance across other modalities. In addition, we investigate RoMAE's ability to reconstruct the embedded continuous positions, demonstrating that including learned embeddings in the input sequence breaks RoPE's relative position property.

2505.18780 2026-05-13 cs.RO cs.LG

DreamPolicy: A Unified World-model Policy for Scalable Humanoid Locomotion

Yahao Fan, Tianxiang Gui, Kaiyang Ji, Shutong Ding, Chixuan Zhang, Yifeng Xu, Ke Yang, Jiayuan Gu, Jingyi Yu, Jingya Wang, Ye Shi

发表机构 * ShanghaiTech University(上海科技大学) InstAdapt

AI总结 实现一种能够适应多种地形的人形机器人行走策略是当前面临的关键挑战。本文提出 DreamPolicy,一种结合离线数据与扩散世界模型的统一策略框架,使单一策略能够掌握已知和未知地形的行走技能。该方法通过地形感知的世界模型生成物理合理的未来轨迹,作为条件策略的动态目标,从而避免手动设计奖励函数。实验表明,DreamPolicy 在未知和复合地形上的表现优于现有最佳方法,为通用人形机器人控制提供了一种可扩展的数据驱动范式。

详情
英文摘要

Achieving versatile humanoid locomotion with a single policy presents a critical scalability challenge. Prevailing methods often rely on distilling multiple terrain-specific teacher policies into a unified student policy. However, while such distillation captures basic locomotion primitives, it struggles to organically compose these skills to adapt to complex environments, resulting in poor generalization to novel composite terrains unseen during training. To overcome this, we present DreamPolicy, a unified framework that integrates offline data with a diffusion-based world model, enabling a single policy to master both known and unseen terrains. Central to our approach is a terrain-aware world model, driven by an autoregressive diffusion world model trained on aggregated rollouts from specialized policies. This model synthesizes physically plausible future trajectories, which serve as dynamic objectives for a conditioned policy, thereby bypassing manual reward engineering. Unlike distillation, our world model captures generalizable locomotion skills, allowing for robust zero-shot transfer to unseen composite terrains. DreamPolicy naturally scales with data availability. As the offline dataset expands, the diffusion world model continuously acquires richer skills. Experiments demonstrate that DreamPolicy outperforms the strongest baseline by up to 27\% on unseen terrains and 38\% on combined terrains. By unifying world model-based planning and policy learning, DreamPolicy breaks the "one task, one policy" bottleneck and establishes a scalable, data-driven paradigm for generalist humanoid control.

2505.11356 2026-05-13 cs.LG

Fractal Graph Contrastive Learning

Nero Z. Li, Xuehao Zhai, Zhichao Shi, Boshen Shi, Xuhui Jiang

发表机构 * CDT, University of Oxford(牛津大学CDT) IDEA Research, International Digital Economy Academy(IDEA研究院、国际数字经济学院) School of Advanced Interdisciplinary Sciences, UCAS(北京大学交叉学科研究院) State Key Lab of AI Safety, Institute of Computing Technology, CAS(人工智能安全国家重点实验室,计算技术研究所,中国科学院) China Mobile Research Institute(中国移动研究院) DataArc Tech Ltd.(DataArc科技有限公司)

AI总结 本文提出了一种名为FractalGCL的图对比学习框架,旨在解决传统图增强方法在全局结构一致性控制上的不足。该方法基于重归一化构建增强图,并引入一种考虑分形维度的对比损失函数,以提升正样本的一致性并优化负样本的排斥效果。为降低计算开销,作者还设计了一种高斯近似方法,显著提升了运行效率。实验表明,FractalGCL在多个基准数据集和现实交通任务中均表现出色,具有良好的预训练和迁移能力。

Comments 32 pages, 7 figures

详情
英文摘要

Graph Contrastive Learning (GCL) relies on semantically consistent graph augmentations, but common local perturbations provide limited control over global structural consistency, motivating a more principled global augmentation strategy. We therefore propose Fractal Graph Contrastive Learning (FractalGCL), a theory-motivated framework that constructs a renormalisation-based augmented graph and introduces a fractal-dimension-aware contrastive loss that penalises unreliable positive views and reweights negative-pair repulsion by finite-scale box-counting discrepancies. However, computing these discrepancies introduces substantial overhead, so we derive and justify a Gaussian surrogate that avoids repeated box-counting on renormalised graphs, yielding about a $61\%$ runtime reduction. Experiments show that FractalGCL serves as an effective frozen-pretraining tool on MalNet-Tiny, achieves strong performance on the standard TUDataset benchmarks, and outperforms the next-best method on real-world urban traffic tasks by $4.51$ percentage points in average accuracy. Code is available at https://anonymous.4open.science/r/FractalGCL-0511/.

2505.10859 2026-05-13 cs.AI

Exploring Nonlinear Pathway in Parameter Space for Machine Unlearning

Yingdan Shi, Ren Wang

发表机构 * Department of Electrical and Computer Engineering, Illinois Institute of Technology, Chicago, IL, USA(伊利诺伊理工学院电气与计算机工程系)

AI总结 本文研究了如何从已训练的机器学习模型中有效移除特定训练数据的影响,提出了一个名为Mode Connectivity Unlearning(MCU)的新框架。该方法利用模式连接性,在参数空间中寻找非线性的“遗忘路径”,并通过参数掩码策略和自适应惩罚系数调整,提升了遗忘效果与计算效率。与传统方法不同,MCU能够发现沿遗忘路径的一系列模型,具有良好的通用性和实验表现。

详情
英文摘要

Machine Unlearning (MU) aims to remove the information of specific training data from a trained model, ensuring compliance with privacy regulations and user requests. While one line of existing MU methods relies on linear parameter updates via task arithmetic, they suffer from weight entanglement. In this work, we propose a novel MU framework called Mode Connectivity Unlearning (MCU) that leverages mode connectivity to find an unlearning pathway in a nonlinear manner. To further enhance performance and efficiency, we introduce a parameter mask strategy that not only improves unlearning effectiveness but also reduces computational overhead. Moreover, we propose an adaptive adjustment strategy for our unlearning penalty coefficient to adaptively balance forgetting quality and predictive performance during training, eliminating the need for empirical hyperparameter tuning. Unlike traditional MU methods that identify only a single unlearning model, MCU uncovers a spectrum of unlearning models along the pathway. Overall, MCU serves as a plug-and-play framework that seamlessly integrates with any existing MU methods, consistently improving unlearning efficacy. Extensive experiments on the image classification task demonstrate that MCU achieves superior performance. The codes are available at https://github.com/TIML-Group/Mode-Connectivity-Unlearning.

2505.02072 2026-05-13 cs.CL cs.AI

Express Your Doubts -- Probabilistic World Modeling Should not be Based on Token logprobs

Eitan Wagner, Omri Abend

发表机构 * Eitan Wagner Omri Abend

AI总结 本文指出,近年来语言模型从字符串分布建模转向用于通用任务的预测模型,这一转变在使用大语言模型作为概率估计器时带来了被忽视的问题,特别是在世界概率建模方面。作者强调,分布估计与响应预测在理论上存在区别,而当前基于token logprobs的方法在不同应用场景下可能导致矛盾的输出分布,从而引发概率解释上的陷阱。文章主张采用二阶预测方法,将概率显式纳入输出,以提升概率建模的理论严谨性。

Comments Accepted to ICML 2026 (position track)

详情
英文摘要

Language modeling has shifted in recent years from a distribution over strings to prediction models with textual inputs and outputs for general-purpose tasks. This position paper highlights the often overlooked implications of this shift for the use of large language models (LLMs) as probability estimators, especially for world probabilities. In light of the theoretical distinction between distribution estimation and response prediction, we examine LLM training phases and common use cases for LLM output probabilities. We show that the different settings lead to distinct, potentially conflicting, desired output distributions. This lack of clarity leads to pitfalls when using output probabilities as event probabilities. Our position advocates for second-order prediction -- incorporating probabilities explicitly as part of the output -- as a theoretically sound method, in contrast to using token logprobs. We conclude with suggestions for potential directions to improve the probabilistic soundness of this method.

2504.12326 2026-05-13 cs.CL cs.AI cs.LG

Reconstructing Sepsis Trajectories from Clinical Case Reports using LLMs: the Textual Time Series Corpus for Sepsis

Shahriar Noroozizadeh, Jeremy C. Weiss

发表机构 * Machine Learning Department and Heinz College Carnegie Mellon University(卡内基梅隆大学机器学习系和海恩兹学院) National Library of Medicine National Institutes of Health(美国国家医学图书馆)

AI总结 该研究旨在从临床病例报告中重建脓毒症患者的病情发展轨迹,利用大语言模型(LLMs)对非结构化的文本进行时序标注和临床发现的提取。研究构建了一个开放获取的脓毒症文本时间序列语料库,包含2,139份PubMed开放获取病例报告,并通过与专家标注的对比验证了模型在时间定位和事件识别上的高准确率。该工作展示了LLMs在临床文本时序重建中的能力,同时指出了其局限性,并提出了多模态整合等改进方向。

Comments Conference on Health, Inference, and Learning (CHIL 2026)

详情
英文摘要

Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter. Complementary structured data streams become available sooner but suffer from incompleteness. To train models and algorithms on more complete and temporally fine-grained data, we construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models. We apply our pipeline to generate an open-access textual time series corpus for Sepsis-3 comprising 2,139 case reports from the PubMed-Open Access (PMOA) Subset. To validate our system, we apply it to PMOA and timeline annotations from i2b2/MIMIC-IV and compare the results to physician-expert annotations. We show high recovery rates of clinical findings (event match rates: GPT-5--0.93, Llama 3.3 70B Instruct--0.76) and strong temporal ordering (concordance: GPT-5--0.965, Llama 3.3 70B Instruct--0.908). Our work characterizes the ability of LLMs to time-localize clinical findings in text, illustrating the limitations of LLM use for temporal reconstruction and providing several potential avenues of improvement via multimodal integration.

2503.06139 2026-05-13 cs.CL

GRP: Goal-Reversed Prompting for Zero-Shot Evaluation with LLMs

Mingyang Song, Mao Zheng, Xuan Luo

发表机构 * Tencent(腾讯)

AI总结 本文提出了一种名为Goal-Reversed Prompting(GRP)的新方法,用于改进基于大语言模型(LLM)的零样本评估。该方法通过让评估模型判断两个候选答案中“较差”的一个,再通过排除法确定偏好,从而提升评估准确性。实验表明,GRP在多个评估任务中显著提升了判断模型的性能,尤其在推理和数学任务中效果更为明显,且该方法兼容多种提示模板,无需额外推理轮次。

Comments Ongoing Work

详情
英文摘要

Pairwise LLM-as-a-judge evaluation asks the judge to identify the \emph{better} of two candidate answers. We study a one-line modification that asks for the \emph{worse} answer instead and recovers the preference by elimination, a procedure we call Goal-Reversed Prompting (GRP). GRP introduces no extra inference rounds, composes with any prompt template (direct, chain-of-thought, or Arena-Hard SOP), and leaves the rest of the evaluation pipeline untouched. Two observations motivate the reversal. Reverse reasoning is a recurring strategy in human problem solving, and modern instruction-tuned judges exhibit a positive-leaning bias that asking for the worse answer can counteract. On JudgeBench under a strict consistency protocol that counts a judgment as correct only when both response orderings agree with the gold preference, GRP improves all three closed-source judges we test across both response-pair sources. With GPT-4o-generated pairs, the Arena-Hard SOP baseline improves from 61.71\% to 66.23\% for GPT-4o (+4.52) and from 60.00\% to 66.00\% for Claude-3.5-Sonnet (+6.00), with the largest absolute gains on Reasoning and Mathematics. The lift persists when response pairs come from Claude-3.5-Sonnet and when the SOP scaffolding is stripped to a minimal direct-prompting template, suggesting that goal reversal acts on the underlying judging behavior rather than on a particular rubric. Stronger judges benefit more than weaker ones, suggesting that goal reversal exposes additional reasoning capacity rather than compensating for its absence.

2502.11981 2026-05-13 cs.LG cs.AI cs.CY

Welfare as a Guiding Principle for Machine Learning -- From Compass, to Lens, to Roadmap

Nir Rosenfeld, Haifeng Xu

发表机构 * Faculty of Computer Science(计算机科学学院) Technion – Israel Institute of Technology(技术ion–以色列理工学院) Department of Computer Science(计算机科学系) University of Chicago(芝加哥大学)

AI总结 本文提出将社会福利作为机器学习设计与应用中的核心指导原则,以促进社会福祉的最大化。作者借鉴福利经济学中关于资源分配的理论,认为在社会场景中,机器学习模型应不仅追求预测准确率,还需关注其对社会整体利益的影响。文章主张将福利作为优化、泛化和表达性之外的第四大核心标准,为机器学习的理论研究和实际应用提供新的方向和评价依据。

详情
英文摘要

Decades of research in machine learning have given us powerful tools for making accurate predictions. But when used in social settings and on human inputs, better accuracy does not immediately translate to better social outcomes. To effectively promote social well-being through machine learning, this position article advocates for the wide adoption of \emph{social welfare} as a guiding principle. The field of welfare economics asks: how should we allocate limited resources to self-interested agents in a way that maximizes social benefit? We argue that this perspective applies to many modern applications of machine learning in social contexts. As such, we propose that welfare serves as an additional core criterion in the design, study, and use of learning algorithms, complementing the conventional pillars of optimization, generalization, and expressivity, and as a compass guiding both theory and practice.

2502.03061 2026-05-13 cs.LG

Pure Exploration Beyond Reward Feedback: The Role of Post-Action Context

Mohammad Shahverdikondori, Amir Mohammad Abouei, Alireza Rezaeimoghadam, Negar Kiyavash

发表机构 * EPFL(瑞士联邦理工学院) Sharif University of Technology(谢赫·穆吉加德姆技术大学)

AI总结 本文研究了在获得动作后上下文信息的随机多臂老虎机环境中,最佳臂识别(BAI)问题的新变种。该问题考虑了在每次动作后,学习者除了获得奖励外,还能获取额外的上下文信息,从而更有效地辅助决策。文章分析了两种不同类型的后动作上下文,并提出了相应的算法,理论上证明其样本复杂度达到最优,同时实验表明利用上下文信息能显著提升性能。

Comments 46 pages, 8 figures

详情
英文摘要

We introduce the problem of best arm identification (BAI) with post-action context, a new BAI problem in a stochastic multi-armed bandit environment and the fixed-confidence setting. The problem addresses the scenarios in which the learner receives a post-action context in addition to the reward after playing each action. This post-action context provides additional information that can significantly facilitate the decision process. We analyze two different types of the post-action context: (i) separator, where the reward depends solely on the context, and (ii) non-separator, where the reward depends on both the action and the context. For both cases, we derive instance-dependent lower bounds on the sample complexity and propose algorithms that asymptotically achieve the optimal sample complexity. For the separator setting, we propose a novel sampling rule called G-tracking, which uses the geometry of the context space to directly track the contexts rather than the actions. For the non-separator setting, we do so by demonstrating that the Track-and-Stop algorithm can be extended to this setting. Moreover, in both settings, we theoretically and empirically show that algorithms that ignore the post-action context are sub-optimal. Finally, our empirical results showcase the advantage of our approaches compared to the state of the art.

2501.19403 2026-05-13 cs.LG cs.AI

Tackling Fake Forgetting through Uncertainty Quantification

Yingdan Shi, Sijia Liu, Kaize Ding, Ren Wang

发表机构 * Illinois Institute of Technology(伊利诺伊理工学院) Michigan State University(密歇根州立大学) Northwestern University(西北大学)

AI总结 本文研究了机器遗忘中的“假遗忘”问题,即模型虽然在遗忘指标上表现良好,但实际仍保留了被遗忘数据的真实标签信息。为解决这一问题,作者提出了一种基于符合预测的新型评估指标CR,用于更可靠地衡量遗忘质量,并进一步设计了一个结合符合预测的遗忘框架CPU,有效提升了遗忘效果。实验表明,该方法在图像分类任务中具有优越的遗忘性能。

详情
英文摘要

Machine unlearning seeks to remove the influence of specified data from a trained model. While the unlearning accuracy provides a widely used metric for assessing unlearning performance, it falls short in assessing the reliability of forgetting. In this paper, we find that the forgetting data points misclassified by unlearning accuracy still have their ground truth labels included in the conformal prediction set from the uncertainty quantification perspective, leading to a phenomenon we term fake forgetting. To address this issue, we propose a novel metric CR, inspired by conformal prediction, that offers a more reliable assessment of forgetting quality. Building on these insights, we further propose an unlearning framework CPU that incorporates conformal prediction into the Carlini & Wagner adversarial attack loss, enabling the ground truth label to be effectively removed from the conformal prediction set. Through extensive experiments on image classification tasks, we demonstrate both the effectiveness of our proposed metric and the superior forgetting quality achieved by our framework. Code is available at https://github.com/TIML-Group/Conformal-Prediction-Unlearning.

2501.16931 2026-05-13 cs.LG stat.AP

Beyond Point Estimates: Distributional Uncertainty in Machine Learning Performance Evaluation

Christoph Lehmann, Yahor Paromau

发表机构 * Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI)(可扩展数据分析与人工智能中心(ScaDS.AI))

AI总结 该论文提出了一种基于分布的机器学习模型评估方法,强调性能指标应被视为随机变量而非固定值,以更全面地反映训练过程中的不确定性。研究通过分析性能指标的经验分布,利用分位数和置信区间进行点估计和区间估计,尤其关注小样本情况下的统计推断可行性。该方法相比传统的基于均值的评估,能够更细致地刻画模型性能的变异性和不确定性,适用于需要可靠性的实际应用场景,并且易于实现和推广。

Comments 21 pages, 9 figures

详情
英文摘要

Machine learning models are often evaluated using point estimates of performance metrics such as accuracy, F1 score, or mean squared error. Such summaries fail to capture the inherent variability induced by stochastic elements of the training process, including data splitting, initialization, and hyperparameter optimization. This work proposes a distributional perspective on model evaluation by treating performance metrics as random quantities rather than fixed values. Instead of focusing solely on aggregate measures, empirical distributions of performance metrics are analyzed using quantiles and corresponding confidence intervals. The study investigates point and interval estimation of quantiles based on real-data use cases for classification and regression tasks, complemented by simulation studies for validation. Special emphasis is placed on small sample sizes, reflecting practical constraints in machine learning, where repeated training is computationally expensive. The results show that meaningful statistical inference on the underlying performance distribution is feasible even with sample sizes in the range of 10-25, while standard nonparametric confidence interval remain applicable under these conditions. The proposed approach provides a more detailed characterization of variability and uncertainty compared to mean-based evaluation and enables a more differentiated comparison of models. In particular, it supports a risk-oriented interpretation of model performance, which is relevant in applications where reliability is critical. The presented methods are easy to implement and broadly applicable, making them a practical extension to standard performance evaluation procedures in machine learning.

2501.08083 2026-05-13 cs.CV

Benchmarking Vision Foundation Models for Input Monitoring in Autonomous Driving

Mert Keser, Halil Ibrahim Orhan, Niki Amini-Naieni, Gesina Schwalbe, Alois Knoll, Matthias Rottmann

发表机构 * Continental AG(大陆汽车集团) Technical University of Munich(慕尼黑技术大学) University of Lübeck(吕贝克大学) University of Oxford(牛津大学) University of Wuppertal(伍珀塔尔大学)

AI总结 该论文研究了在自动驾驶等复杂开放领域中,如何利用视觉基础模型(VFM)进行输入监控以检测超出训练数据分布的场景(OOD)。作者提出了一种无需监督、模型无关的方法,通过结合VFM作为特征提取器与密度建模技术,统一检测语义偏移和协变量偏移。实验表明,该方法在多种条件下优于现有OOD分类方法,并能有效识别可能引发下游任务错误的高风险输入,为复杂视觉任务中的安全监控提供了新思路。

详情
英文摘要

Deep neural networks (DNNs) remain challenged by distribution shifts in complex open-world domains like automated driving (AD): Robustness against yet unknown novel objects (semantic shift) or styles like lighting conditions (covariate shift) cannot be guaranteed. Hence, reliable operation-time monitors for identification of out-of-training-data-distribution (OOD) scenarios are imperative. Current approaches for OOD classification are untested for complex domains like AD, are limited in the kinds of shifts they detect, or even require supervision with OOD samples. To prepare for unanticipated shifts, we instead establish a framework around a principled, unsupervised and model-agnostic method that unifies detection of semantic and covariate shifts: Find a full model of the training data's feature distribution, to then use its density at new points as in-distribution (ID) score. To implement this, we propose to combine Vision Foundation Models (VFMs) as feature extractors with density modeling techniques. Through a comprehensive benchmark of 4 VFMs with different backbone architectures and 5 density-modeling techniques against established baselines, we provide the first systematic evaluation of OOD classification capabilities of VFMs across diverse conditions. A comparison with state-of-the-art binary OOD classification methods reveals that VFM embeddings with density estimation outperform existing approaches in identifying OOD inputs. Additionally, we show that our method detects high-risk inputs likely to cause errors in downstream tasks, thereby improving overall performance. Overall, VFMs, when coupled with robust density modeling techniques, are promising to realize model-agnostic, unsupervised, reliable safety monitors in complex vision tasks

2501.02955 2026-05-13 cs.CV

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang

发表机构 * Tsinghua University(清华大学) Zhipu AI(智谱AI)

AI总结 近年来,视觉语言模型(VLMs)在视频理解方面取得了显著进展,但对细粒度运动的理解仍缺乏系统研究。为此,本文提出了MotionBench,一个全面评估视频模型细粒度运动理解能力的基准,包含六类运动相关问题和多源视频数据。实验表明现有VLM在细粒度运动理解上表现不佳,作者通过分析视频特征压缩架构并提出一种高效的Through-Encoder融合方法,有效提升了模型的运动感知能力,展示了该方向仍有较大的提升空间。

Comments 20 pages

详情
英文摘要

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .

2412.18594 2026-05-13 cs.LG stat.ML

Local and Mixing-Based Algorithms for Gaussian Graphical Model Selection from Glauber Dynamics

Vignesh Tirukkonda, Anirudh Rayas, Gautam Dasarathy

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 该论文研究了在数据来自高斯Glauber动力学的依赖样本下,如何进行高斯图模型结构学习的问题。作者提出了两种互补的方法:一种是基于相关性检验的局部边检测算法,无需等待链的混合即可并行实现;另一种是在满足Dobrushin收缩条件时,通过子采样高斯Gibbs轨迹,使其在总变分距离下接近独立同分布样本,从而可直接使用标准的独立样本图模型学习方法。研究还提供了有限样本下的恢复保证,并分析了观测时间的信息下界。

Comments Major revision. Corrects the earlier local ratio-estimator analysis by replacing it with a local product estimator; adds a burn-in/thinning estimator based on total-variation decoupling for Gaussian Gibbs samplers; strengthens the lower bounds; adds experiments; and compares with the related ICML 2026 work of Shen, Wu, Majid, and Moitra

详情
英文摘要

Gaussian graphical model selection is usually studied under independent sampling, but in many applications observations arise from dependent dynamics. We study structure learning when the data consist of a single trajectory of Gaussian Glauber dynamics. We develop two complementary approaches. The first is a local edge-testing estimator based on an appropriately designed correlation test that reveals edges. This estimator does not require waiting for the chain to mix and admits an embarrassingly parallel edgewise implementation. The second is a burn-in/thinning reduction: under a Dobrushin contraction condition, we prove that a suitably subsampled Gaussian Gibbs trajectory is close in total variation to an i.i.d. product sample, allowing standard i.i.d. Gaussian graphical model learners to be used as black boxes. The key technical ingredient, which may be of independent interest, is a high-dimensional total-variation bound for random-scan Gaussian Gibbs samplers, obtained by combining Wasserstein contraction with an approximate Lipschitz smoothing argument. We prove finite-sample recovery guarantees for both approaches, establish information-theoretic lower bounds on the observation time, and empirically compare the resulting sample-computation tradeoffs.