arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1938
专题追踪
2605.21132 2026-05-21 cs.CV

SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary

SurgOnAir: 基于层次感知的实时手术视频评论

Jingyi He, Yue Zhou, Long Bai, Kun Yuan, Nassir Navab, Yuan Bi

发表机构 * Computer Aided Medical Procedures (CAMP), TU Munich, Germany Munich Center for Machine Learning (MCML), Munich, Germany University of Strasbourg, France The Chinese University of Hong Kong, Hong Kong

AI总结 本研究提出SurgOnAir,一种流式视觉-语言模型,通过层次化数据集实现对手术流程多层级的实时理解与评论生成,提升手术过程中的即时响应能力。

详情
AI中文摘要

理解手术流程的实时动态对于智能手术系统至关重要,其中AI系统需要持续感知并响应手术进展。在手术室中,关键决策依赖于细微且即时的变化,如精细的器械运动和不断演变的组织状态,其中即使是轻微的感知延迟也可能限制辅助或危及安全。然而,现有方法仍为离线或在粗粒度时间尺度上操作,仅在处理视频片段后生成描述,阻碍了即时反应。为此,我们提出SurgOnAir,一种流式视觉-语言模型,能够按顺序处理帧,无需未来信息,并在视觉输入到达时逐步生成叙述标记。SurgOnAir实现了细粒度的帧到标记生成,能够即时响应不断变化的手术动态。基于我们精心编纂的层次化数据集SurgOnAir-11k,该模型被训练以生成多级文本响应,反映手术流程的内在层次结构。此外,特殊过渡标记被生成以显式标记状态变化,使SurgOnAir能够捕捉并信号关键工作流程的转变。实验表明,SurgOnAir通过单一的视觉-语言模型实现了对手术流程多个层次的实时理解,生成更优且层次感知的叙述。代码和数据集将公开。

英文摘要

Understanding surgical workflow in real time is fundamental for intelligent surgical embodiment, where AI systems continuously perceive and respond as surgery proceeds. In the operating room, critical decisions depend on subtle, moment-to-moment changes, such as fine instrument movements and evolving tissue states, where even slight perceptual delays can limit assistance or compromise safety. Yet existing methods remain offline or operate at coarse temporal scales, generating descriptions only after processing clips, preventing immediate reaction. We address this by proposing SurgOnAir, a streaming vision-language model that processes frames sequentially without future access and progressively generates narration tokens as visual input arrives. SurgOnAir achieves fine-grained frame-to-token generation, enabling instant responsiveness to evolving surgical dynamics. Built upon our curated hierarchical dataset SurgOnAir-11k spanning action-, step-, and phase-level supervision, the model is trained to produce multi-level textual responses that reflect the inherent hierarchy of surgical procedures. Furthermore, special transition tokens are generated to explicitly mark state changes, allowing SurgOnAir to capture and signal key workflow transitions as they occur. Experiments show that SurgOnAir enables real-time understanding through a single vision-language model that unifies streaming across multiple hierarchies of the surgical workflow, generating superior and hierarchy-aware narrations. Code and dataset will be public.

2605.21131 2026-05-21 cs.CV

UniT: Unified Geometry Learning with Group Autoregressive Transformer

UniT: 基于群自回归变换器的统一几何学习

Haotian Wang, Yusong Huang, Zhaonian Kuang, Hongliang Lu, Xinhu Zheng, Meng Yang, Gang Hua

发表机构 * Intelligent Transportation Thrust of the Systems Hub, The Hong Kong University of Science and Technology (GZ), Guangzhou, P.R.China(香港理工大学(广州)系统中心智能交通研究组,中国广东省广州市) The National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University, Xi’an, P.R.China(人机混合增强智能国家级重点实验室,西安交通大学,中国陕西省西安市) Applied Science, Amazon.com, Inc., USA(亚马逊公司应用科学部,美国)

AI总结 本文提出UniT模型,通过群自回归变换器统一了几何感知中的多种能力,包括在线感知、离线重建、多模态融合、长视界扩展和度量尺度估计,并引入了适应性几何损失以提升跨场景的度量尺度泛化能力。

Comments Submitted to IEEE T-PAMI

详情
AI中文摘要

近期的前馈模型在从传感器观测推断密集3D结构方面显著进步。然而,其本质能力仍然分散在多个不兼容的范式中,包括在线感知、离线重建、多模态整合、长视界可扩展性和度量尺度估计。我们提出了UniT,一种基于新颖的群自回归变换器的统一模型,将这些看似不同的能力重新整合到单一框架中。关键思想是将传感器观测的组视为基本的自回归单元,并以无锚点和自适应尺度的方式预测相应的点图。更具体地说,在线和离线设置中的各种视角配置自然地整合到单一的群自回归过程中。通过改变组的大小,在线模式在多个自回归步骤上使用单帧组,而离线模式在单次前向传递中聚合多帧组。同时,队列式KV缓存机制确保了长视界下的有界自回归内存。这通过减少对早期帧的长距离依赖,通过无锚点关系建模实现,从而允许过时的记忆在飞行中被丢弃。为了提高跨场景的度量尺度泛化能力,进一步在该框架中引入了自适应几何损失。它将相对几何约束与部分绝对尺度项耦合,隐含地正则化全局尺度,并诱导从尺度不变几何到度量尺度解决方案的逐步过渡。与专门的模态注意力模块相结合,用于整合辅助模态,UniT在十个基准上实现了统一几何感知的最先进性能,涵盖了七个代表性任务。

英文摘要

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.

2605.21130 2026-05-21 cs.CV

VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment

VersusQ:用于通用视频质量评估的成对边距推理

Shibei Meng, Binxin Yang, Yuan Liu, Jiexuan Zhang, Zhengyao Lv, Hubery Yin, Qiang Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) WeChat Vision, Tencent Inc.(腾讯公司视觉部门) Beijing Normal University(北京师范大学) Peking University(北京大学) The University of Hong Kong(香港大学)

AI总结 本文提出VersusQ,一种基于成对边距推理的框架,通过直接比较视频来缓解绝对尺度校准偏差,实现跨域的视频质量评估。

详情
AI中文摘要

大型多模态模型(LMMs)在视频质量评估中展现出潜力,但大多数方法仍为每个视频预测一个绝对分数。这种点wise监督通常混合了感知质量和数据集特定的校准,包括标注协议、评分习惯和分数分布。因此,学习到的评分规则可能在基准内表现良好,但在未见过的领域转移效果差。我们主张相对比较通过纯粹关注感知差异而非数据集特定的评分习惯来缓解绝对尺度校准偏差。因此,我们提出了VersusQ,一种完全由直接比较驱动的成对边距推理框架。具体而言,VersusQ在两个视频之间进行基于LMM的比较,推断它们的视觉和时间质量差异,并预测一个带符号的连续边距,以捕捉首选选择和差异程度。此外,为了将可解释的比较理由与细粒度的数值差异对齐,我们引入了Margin-Coupled GRPO,它联合优化基于展开的相对推理和连续边距回归。在多个公共VQA基准上的广泛实验表明,VersusQ在多个公共VQA基准上实现了最先进的性能,强大的跨域泛化能力以及在异构评估场景下的可靠细粒度排名。

英文摘要

Large Multimodal Models (LMMs) have shown promise for video quality assessment, but most methods still predict an absolute score for each video. Such pointwise supervision often mixes perceptual quality with dataset-specific calibration, including annotation protocols, rating habits, and score distributions. As a result, the learned scoring rule may work well within a benchmark but transfer poorly across unseen domains. We argue that relative comparisons alleviate the absolute-scale calibration bias by focusing purely on perceptual differences rather than dataset-specific rating habits. Consequently, we propose \textbf{VersusQ}, a pairwise margin reasoning framework driven entirely by direct comparisons. Specifically, VersusQ performs LMM-based comparison between two videos, reasons about their visual and temporal quality differences, and predicts a signed continuous margin that captures both the preferred choice and the degree of difference. Furthermore, to align interpretable comparison rationales with fine-grained numerical differences, we introduce Margin-Coupled GRPO, which jointly optimizes rollout-based relational reasoning and continuous margin regression. Extensive experiments on multiple public VQA benchmarks demonstrate that VersusQ achieves state-of-the-art performance, strong cross-domain generalization, and reliable fine-grained ranking under heterogeneous evaluation scenarios.

2605.21127 2026-05-21 cs.LG

Reasoning-Trace Collapse: Evaluating the Loss of Explicit Reasoning During Fine-Tuning

推理轨迹坍缩:在微调过程中显式推理能力的丧失评估

Lukas Twist, Helen Yannakoudakis, Jie M. Zhang

发表机构 * King’s College London(伦敦国王学院)

AI总结 本文研究了在微调过程中显式推理能力的丧失问题,提出了一种结构评估框架来区分答案正确性与推理轨迹的有效性,并发现标准监督微调会迅速抑制有效的推理轨迹,而仅关注答案的指标会掩盖这一问题。

Comments 22 pages, 3 tables, 3 figures

详情
AI中文摘要

显式推理模型被训练以在最终答案之前生成中间推理轨迹,但下游微调通常在不包含此类轨迹的普通指令-响应数据上进行。我们证明这种不匹配会导致推理轨迹坍缩:微调后的模型仍然能生成合理的最终答案,但会失去使其成为推理模型的结构有效推理轨迹。我们引入了一种结构评估框架,将答案正确性与推理轨迹有效性分开,测量有效、空、缺失和截断的推理轨迹以及基于推理的任务性能。使用该框架,我们研究了四个开放式推理模型,发现标准监督微调可以迅速抑制有效的推理轨迹,而仅关注答案的指标会显著掩盖这一失败:在几种设置中,基于有效推理的性能仍保持高位,而有效推理的比例却大幅下降。我们进一步表明,简单的损失屏蔽策略可以在不需教师生成推理轨迹的情况下显著缓解坍缩。这些结果表明,微调后的推理模型的评估应报告结构推理可靠性指标,尤其是在适应数据不包含显式推理轨迹的情况下。

英文摘要

Explicit reasoning models are trained to produce intermediate reasoning traces before final answers, but downstream fine-tuning is often performed on ordinary instruction-response data that contains no such traces. We show that this mismatch can induce reasoning-trace collapse: a fine-tuned model continues to produce plausible final answers while losing the structurally valid explicit reasoning traces that made it a reasoning model in the first place. We introduce a structural evaluation framework that separates answer correctness from reasoning-trace validity, measuring valid, empty, missing, and truncated reasoning alongside reasoning-conditioned task performance. Using this framework, we study four open-weight reasoning models and find that standard supervised fine-tuning can rapidly suppress valid reasoning traces, and that answer-only metrics can substantially obscure this failure: in several settings, performance conditional on valid reasoning remains high while the rate of valid reasoning falls sharply. We further show that simple loss-masking strategies can substantially mitigate collapse without requiring teacher-generated reasoning traces. These results suggest that evaluations of fine-tuned reasoning models should report structural reasoning reliability metrics in addition to final-answer performance, especially when adaptation data does not contain explicit reasoning traces.

2605.21123 2026-05-21 cs.CV cs.LG

Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models

Linear-DPO: 用于扩散和流匹配生成模型的线性直接偏好优化

Kesong Li, Yixuan Xu, Kuo-kun Tseng, Weiyi Lu, Kan Liu, Tao Lan

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) Alibaba Group(阿里巴巴集团)

AI总结 本文提出Linear-DPO,通过统一的反向时间SDE框架推导出涵盖扩散和流匹配的通用DPO目标,指出标准DPO目标在文本到图像生成中不最优,并通过定性定量实验验证了其在扩散模型和流匹配模型上的优越性。

Comments Code and models are available at: https://github.com/Whynot0101/Linear-DPO . Work done during an internship at Alibaba Group

详情
AI中文摘要

直接偏好优化(DPO)在大语言模型对齐中取得成功,但在文本到图像生成中仍面临挑战。现有研究局限于去噪扩散模型,忽略了流匹配,并在将离散NLP基础的DPO应用于回归基础生成任务时存在目标不匹配的问题。本文推导出一个通用的DPO目标,通过统一的反向时间SDE框架涵盖扩散和流匹配,并从梯度角度指出标准DPO目标在文本到图像生成中不最优。因此,我们提出Linear-DPO,用持续的线性效用函数替代了激进的sigmoid基效用函数,并结合EMA更新的参考模型。在扩散模型(SD1.5、SDXL)和流匹配模型(SD3-Medium)上的定性和定量实验展示了我们的方法优于现有基线。

英文摘要

Direct Preference Optimization (DPO) is successful for alignment in LLMs but still faces challenges in text-to-image generation. Existing studies are confined to denoising diffusion models while overlooking flow-matching, and suffer from an objective mismatch when applying discrete NLP-based DPO to regression-based generative tasks.\ In this paper, we derive a generalized DPO objective that covers both diffusion and flow-matching via a unified reverse-time SDE framework, and point out from a gradient perspective that the standard DPO objective is suboptimal for text-to-image generation. Consequently, we propose Linear-DPO, which replaces the aggressive sigmoid-based utility function with a sustained linear utility and incorporates an EMA-updated reference model. Qualitative and quantitative experiments on diffusion models (SD1.5, SDXL) and flow-matching model (SD3-Medium) demonstrate the superiority of our approach over existing baselines.

2605.21121 2026-05-21 cs.CV cs.GR

ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation

ROAR-3D: 为高保真3D生成实现任意视角路由

Hanxiao Sun, Mingxin Yang, Shuhui Yang, Zebin He, Xintong Han, Hongbo Fu, Chunchao Guo, Wenhan Luo

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Tencent Hunyuan(腾讯文生)

AI总结 本文提出ROAR-3D方法,通过改进预训练单视角模型以支持任意数量的未置位图像,利用视图路由和双流注意力设计实现高效的多视角3D生成,显著提升生成质量并支持测试时视角扩展。

详情
AI中文摘要

单图像到3D生成模型现在可以生成高质量的几何结构,但对单个视角的条件化不可避免地引入了对未见区域的模糊性。多视角条件化可以减少这种模糊性,但现有方法要么要求固定标准视角,要么依赖外部重建模块,这会带来沉重的训练成本并限制生成质量。我们观察到预训练的单视角模型已经具备强大的2D到3D基础,可以重新用于多视角条件化。然而,更深入的分析表明,它们的条件机制将方向控制与几何传输纠缠在一起,当来自不同视角的图像被简单结合时,这两种功能会冲突。基于此分析,我们提出ROAR-3D,一种轻量级方法,将预训练的单视角模型升级以接受任意数量的未置位图像。一个逐token的视图路由器将每个3D潜在token分配给其最相关的视角,隐式地建立2D到3D对应关系,而无需显式姿态输入。双流注意力设计保留了预训练的主要视角行为,同时通过专用路径路由辅助视角以实现几何增强。一个方向扰动策略确保辅助路径学习方向无关的几何传输。这些组件引入了极小的可训练参数,并在单视角基准上增加了可忽略的推理开销。ROAR-3D在多视角3D生成质量上达到最先进的水平,并支持测试时视角扩展从1到12+个视角,具有一致的改进。

英文摘要

Single-image-to-3D generative models can now produce high-quality geometry, yet conditioning on a single view inevitably introduces ambiguity about unseen regions. Multi-view conditioning can reduce this ambiguity, but existing methods either require fixed canonical viewpoints or rely on external reconstruction modules that impose heavy training costs and limit generation quality. We observe that pretrained single-view models already possess strong 2D-to-3D grounding that can be reused for multi-view conditioning. However, a closer analysis reveals that their conditioning mechanism entangles orientation control with geometry transfer, two functions that conflict when images from different viewpoints are naively combined. Based on this analysis, we propose ROAR-3D, a lightweight method that upgrades a pretrained single-view model to accept an arbitrary number of unposed images. A token-wise view router assigns each 3D latent token to its most relevant view, implicitly establishing 2D-to-3D correspondences without explicit pose input. A dual-stream attention design preserves the pretrained primary-view behavior while routing auxiliary views through a separate path dedicated to geometric enrichment. An orientation perturbation strategy ensures the auxiliary path learns orientation-independent geometry transfer. These components introduce minimal trainable parameters and add negligible inference overhead relative to the single-view baseline. ROAR-3D achieves state-of-the-art multi-view 3D generation quality and supports test-time view scaling from 1 to 12+ views with consistent improvements.

2605.21114 2026-05-21 cs.LG

A Unified Framework for Uncertainty-Aware Explainable Artificial Intelligence: A Case Study in Power Quality Disturbance Classification

不确定性感知可解释人工智能的统一框架:电力质量扰动分类的案例研究

Yinsong Chen, Samson S. Yu, Zhong Li, Chee Peng Lim

发表机构 * School of Engineering, Deakin University(德肯大学工程学院) Faculty of Mathematics and Computer Science, FernUniversität in Hagen(哈根应用科学大学数学与计算机科学学院) Department of Computing Technologies, Swinburne University of Technology(斯威本科技大学计算技术学院)

AI总结 本文提出了一种统一的框架,用于不确定性感知的可解释人工智能,通过在电力质量扰动分类任务中使用贝叶斯神经网络来捕捉解释分布的变异性,以提高决策的不确定性意识。

详情
AI中文摘要

事后可解释人工智能(XAI)方法通常产生确定性的归因图,而贝叶斯神经网络(BNNs)则在解释上诱导出一个分布。捕捉这种分布的变异性对于不确定性感知的决策至关重要。本文将解释分布定义为通过任何Lipschitz连续的归因操作符将BNN后验推前得到的测度。进一步,本文提出了不确定性感知的相关归因操作符(UA-RAO),这是一个概括性的操作符家族,通过均值、方差、变异系数、分位数和集合论聚合度量来总结解释分布。通过蒙特卡洛可访问性和Wasserstein近似界提供了理论支持。该框架在15类电力质量扰动(PQD)分类基准上进行了评估,比较了三种BNN近似方法与三种归因操作符,使用相关质量准确度和交并比作为局部化度量。结果表明,深度集成模型与均值UA-RAO相比,在确定性基线之上提高了局部化效果,而其他UA-RAO总结揭示了点估计归因中不存在的不确定性模式。对测量信号的定性结果进一步表明,这些模式能够超越合成训练分布。该框架是领域无关的,可以应用于任何配对Lipschitz连续归因操作符的BNN。

英文摘要

Post-hoc explainable AI (XAI) methods typically produce deterministic attribution maps, whereas Bayesian neural networks (BNNs) induce a distribution over explanations. Capturing the variability of this distribution is important for uncertainty-aware decision-making. This paper formalises the \emph{explanation distribution} as the push-forward measure of the BNN posterior through any Lipschitz-continuous attribution operator. It further proposes the uncertainty-aware relevance attribution operator (UA-RAO), a general family of operators that summarises the explanation distribution using the mean, variance, coefficient of variation, quantiles, and set-theoretic aggregation measures. Theoretical support is provided through Monte Carlo accessibility and Wasserstein approximation bounds. The framework is evaluated on a 15-class power quality disturbance (PQD) classification benchmark, comparing three BNN approximations paired with three attribution operators using relevance mass accuracy and intersection-over-union as localisation metrics. Results show that deep ensembles with the mean UA-RAO improve localisation over the deterministic baseline, while other UA-RAO summaries reveal uncertainty patterns absent from point-estimate attributions. Qualitative results on measured signals further suggest that these patterns generalise beyond the synthetic training distribution. The framework is domain-agnostic and can be applied to any BNN paired with a Lipschitz-continuous attribution operator.

2605.21112 2026-05-21 cs.CV

RCGDet3D: Rethinking 4D Radar-Camera Fusion-based 3D Object Detection with Enhanced Radar Feature Encoding

RCGDet3D: 重新思考基于增强雷达特征编码的4D雷达-相机融合3D目标检测

Weiyi Xiong, Bing Zhu

发表机构 * School of Automation Science and Electrical Engineering, Beihang University(北京航空航天大学自动化科学与电气工程学院)

AI总结 本文提出RCGDet3D,通过增强雷达特征编码而非复杂的多模态融合策略,实现了在3D目标检测中更高的准确性和实时性,为实时部署设定了新标准。

详情
AI中文摘要

由于其低成本和鲁棒性,4D汽车雷达对于自动驾驶至关重要,但其点云稀疏性挑战了3D目标检测。现有的4D雷达-相机融合方法侧重于复杂的融合策略,以牺牲推理速度换取微小的增益。这种权衡阻碍了实时部署,因为密集特征图上的计算负担较大。相比之下,从稀疏雷达点中提取特征更加耗时,但仍然被低估。本文发现,仅仅增强雷达特征提取可以实现与复杂融合模块相当或更高的性能,同时保持实时性能。基于这一发现,我们提出了RCGDet3D,其核心在于雷达特征编码和简化多模态融合。其编码器继承自RadarGaussianDet3D中的高效高斯点编码器(PGE),并有两个关键改进。首先,Ray-centric PGE(R-PGE)在射线对齐的坐标系统中预测高斯属性,然后统一到鸟瞰图(BEV)空间,显著提高了几何一致性并减少了学习难度,通过将坐标转换与表征学习解耦。其次,语义注入(SI)模块结合图像中的视觉线索,产生更具几何准确性和语义丰富性的雷达特征。在View-of-Delft(VoD)和TJ4DRadSet上的实验表明,RCGDet3D在准确性和速度上均优于现有最先进方法,为实时部署设定了新的基准。

英文摘要

4D automotive radar is indispensable for autonomous driving due to its low cost and robustness, yet its point cloud sparsity challenges 3D object detection. Existing 4D radar-camera fusion methods focus on complex fusion strategies, trading inference speed for marginal gains. This trade-off hinders real-time deployment due to heavy computation on dense feature maps. In contrast, feature extraction from sparse radar points is less time-consuming but remains under-explored. This work uncovers that simply enhancing radar feature extraction can achieve comparable or even higher performance than elaborate fusion modules, while maintaining real-time performance. Based on this finding, we propose RCGDet3D, which centers on radar feature encoding and simplifies multi-modal fusion. Its encoder inherits from the efficient Gaussian Splatting-based Point Gaussian Encoder (PGE) in RadarGaussianDet3D with two key improvements. First, the Ray-centric PGE (R-PGE) predicts Gaussian attributes in ray-aligned coordinate systems before unifying them to Bird's-Eye View (BEV) space, significantly improving geometric consistency and reducing learning difficulty by decoupling the coordinate transformation from representation learning. Second, a Semantic Injection (SI) module incorporates visual cues from images, producing more geometrically accurate and semantically enriched radar features. Experiments on View-of-Delft (VoD) and TJ4DRadSet show that RCGDet3D outperforms state-of-the-art methods in both accuracy and speed, setting a new benchmark for real-time deployment.

2605.21111 2026-05-21 cs.RO cs.SY eess.SY

Benchmarking Empirical and Learning-Based Approaches for Feedforward Steering Control in Autonomous Racing

为自动驾驶赛车中的前馈转向控制评估经验方法和学习方法

Georg Jank, Mattia Piccinini, Sebastian Wenk, Phillip Pitschi, Johannes Betz, Boris Lohmann

发表机构 * Chair of Automatic Control, Department of Engineering Physics and Computation, Technical University of Munich(慕尼黑技术大学自动控制系) Professorship of Autonomous Vehicle Systems (AVS), Department of Mobility Systems Engineering, Technical University of Munich(慕尼黑技术大学移动系统工程系自动驾驶车辆系统教授职位)

AI总结 本文通过系统评估两种学习方法和两种经验方法的前馈转向控制器,发现学习方法在开环评估中预测误差最小,但在闭环测试中路径跟踪性能和圈速并不优于所提出的方法,表明在完整轨迹规划和控制软件栈中评估前馈策略的必要性。

Comments 8 pages, 12 figures, Accepted to be published as part of the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026), Naples, Italy, September 15-18, 2026

详情
AI中文摘要

前馈转向控制是自动驾驶赛车分层控制架构中的关键组成部分。其目标是通过预测车辆的逆横向动力学来减少反馈控制器的转向修正。本文系统地比较了两种学习方法和两种经验(分析)前馈转向控制器。我们提出了一种基于多项式曲面拟合的新ehd公式,能够以最小的参数化捕捉速度依赖的非线性转向行为。我们使用基于现实世界阿布扎比分级自动驾驶赛车联赛的高保真度仿真框架,在高保真度双赛道车辆动力学仿真器中测试前馈控制器。开环评估显示,学习方法实现了最低的预测误差;然而闭环测试显示,这种改进的准确性并未转化为更好的路径跟踪性能或圈速,即使经过迭代微调后也是如此。相比之下,所提出的ehd方法在整体闭环鲁棒性和圈速方面表现最佳,突显了在完整轨迹规划和控制软件栈中评估前馈策略的必要性。我们的代码可在https://github.com/TUMRT/steering_ff_control上获得。

英文摘要

Feedforward steering control is a key component of hierarchical control architectures for autonomous racing. The goal is to reduce steering corrections from the feedback controllers by predicting the vehicle's inverse lateral dynamics. This paper presents a systematic benchmark of two learning-based and two empirical (analytical) feedforward steering controllers. We introduce a new \acf{ehd} formulation based on a polynomial surface fit that captures velocity-dependent nonlinear steering behavior with minimal parametrization. We test the feedforward controllers in a high-fidelity simulation framework based on the real-world Abu Dhabi Autonomous Racing League competition, using a high-fidelity double-track vehicle dynamics simulator. Open-loop evaluation shows that the learning-based controllers achieve the lowest prediction errors; however, closed-loop testing reveals that this improved accuracy does not translate into superior path tracking performance or lap times, even after iterative fine-tuning. In contrast, the proposed EHD approach achieves the best overall closed-loop robustness and lap time, highlighting the necessity of evaluating feedforward strategies within the complete trajectory planning and control software stack. Our code is available at https://github.com/TUMRT/steering_ff_control.

2605.21109 2026-05-21 cs.RO

Anomaly-Informed Confidence Calibration for Vision-Based Safety Prediction

基于异常的置信度校准用于基于视觉的安全预测

Zhenjiang Mao, Jiawen Wu, Gabriel Wagner, Zhongzheng Zhang, Ivan Ruchkin

发表机构 * Trustworthy Engineered Autonomy (TEA) Lab, Department of Electrical and Computer Engineering, University of Florida(可信工程自主性实验室,电气与计算机工程系,佛罗里达大学)

AI总结 本文提出了一种基于异常的在线校准方法,通过融合感知和动态异常分数来改进基于视觉的安全预测中的置信度估计,从而在面对分布偏移时减少过自信,提升预测性能。

详情
AI中文摘要

可靠的置信度估计对于安全部署基于视觉的控制器至关重要,特别是在自动驾驶赛车中,安全预测必须从摄像头图像中推导出来,但现代预测器在测试时面临分布偏移时会变得危险地自信。我们发现现有异常信号中存在一个关键的感知-动态差距:广泛使用的分数,如自编码器重构误差,只能捕捉视觉损坏,却无法捕捉动态异常(例如执行偏差、延迟),其中图像仍然合理而轨迹却恶化。为了解决这个问题,我们提出了一种基于异常的在线校准方法,该方法不重新训练任何模型组件,融合了从世界模型中提取的两个互补的异常分数:一个来自重构误差的感知分数和一个来自epistemic不确定性及控制流统计的动态分数。基于这些融合的分数,一个轻量级的温度缩放校准器利用测试时增强来选择性地减少偏移下的过自信,同时保持正常条件下的性能。在四个未在训练中见过的真实世界异常协议(黑暗、模糊、执行偏差、处理延迟)下的物理DonkeyCar上进行实验,将平均预期校准误差从0.184降低到0.116,比最佳基线提高了37%,而无需修改基础安全预测器。

英文摘要

Reliable confidence estimates are important for safely deploying vision-based controllers in autonomous racing, where safety predictions must be derived from camera images, yet modern predictors become dangerously overconfident under test-time distribution shifts. We identify a critical perception-dynamics gap in existing anomaly signals: widely used scores, such as autoencoder reconstruction error, capture visual corruptions but miss dynamics anomalies (e.g., actuation bias, latency), where images remain plausible while the trajectory degrades. To address this, we propose an Anomaly-Informed Online Calibration approach that, without retraining any model component, fuses two complementary anomaly scores extracted from a world model: a perceptual score from reconstruction error and a dynamics score from epistemic uncertainty and control-stream statistics. Based on these fused scores, a lightweight temperature-scaling calibrator leverages test-time augmentation to selectively reduce overconfidence under shift while preserving nominal-condition performance. Experiments on a physical DonkeyCar under four real-world anomaly protocols unseen during training (darkness, blur, actuation bias, processing latency) reduce average expected calibration error from 0.184 to 0.116, a 37% improvement over the best baseline, without modifying the base safety predictor.

2605.21107 2026-05-21 cs.LG stat.ML

Improved Guarantees for Constrained Online Convex Optimization via Self-Contraction

通过自收缩性获得约束在线凸优化的改进保证

Dhruv Sarkar, Abhishek Sinha

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur(计算机科学与工程系,印度理工学院Kharagpur分校) School of Technology and Computer Science, Tata Institute of Fundamental Research, Mumbai, India(技术与计算机科学学院,印度基础研究科学研究院,孟买,印度)

AI总结 本文提出了一种基于投影的算法,在强凸损失下同时实现O(log T)的 regrets 和 O(log T) 的 CCV,对于凸损失则在保持最优 O(√T) regrets 的同时将 CCV 提升到 O(√T)。

详情
AI中文摘要

我们考虑了具有对抗性选择约束的约束在线凸优化 (COCO)。在每一轮中,学习者在观察该轮损失和约束函数之前选择动作。目标是在满足所有约束的最佳点上实现小静态遗憾,同时控制累积约束违反(CCV)。对于强凸损失,最先进的算法实现 O(log T) 的遗憾和 O(√(T log T)) 的 CCV。对应的凸损失最佳已知界限是 O(√T) 的遗憾和 O(√T log T) 的 CCV。在本文中,我们提出了一种简单的投影算法,对于强凸损失同时实现 O(log T) 的遗憾和 O(log T) 的 CCV,从而在 CCV 方面实现了指数级改进。对于凸损失,我们的算法将 CCV 提高到 O(√T),同时保持最优的 O(√T) 悲伤。我们改进的关键是一个最近的几何结果,用于自收缩曲线,这可能具有独立兴趣。

英文摘要

We consider Constrained Online Convex Optimization (COCO) with adversarially chosen constraints. At each round, the learner chooses an action before observing the loss and constraint function for that round. The goal is to achieve small static regret against the best point satisfying all constraints while also controlling cumulative constraint violation ($\mathsf{CCV}$). For strongly convex losses, state-of-the-art algorithms achieve $O(\log T)$ regret and $O(\sqrt{T \log T})$ $\mathsf{CCV}.$ The corresponding best-known bounds for convex losses is $O(\sqrt{T})$ regret and $O(\sqrt{T} \log T)$ $\mathsf{CCV}$. In this paper, we give a simple projection-based algorithm that simultaneously achieves $O(\log T)$ regret and $O(\log T)$ $\mathsf{CCV}$ for strongly-convex losses, yielding an exponential improvement in the $\mathsf{CCV}$. For the convex losses, our algorithm improves the $\mathsf{CCV}$ to $O(\sqrt{T})$ while maintaining the optimal $O(\sqrt{T})$ regret. The key to our improvement is a recent geometric result for self-contracted curves, which may be of independent interest.

2605.21104 2026-05-21 cs.LG

HORST: Composing Optimizer Geometries for Sparse Transformer Training

HORST:用于稀疏Transformer训练的优化几何组合

Tom Jacobs, Rohan Jain, Rebekka Burkholz

发表机构 * CISPA Helmholtz Center for Information Security(CISPA 河岸信息安全中心)

AI总结 本文提出HORST,一种结合优化几何的模块化优化器,通过超几何镜像映射引入L1稀疏性偏置,以在保持训练稳定性的同时促进稀疏性。

Comments 22 pages, 8 figures

详情
AI中文摘要

稀疏化Transformer仍然是一个根本性挑战,因为标准优化器无法同时促进稀疏性和保持训练稳定性。有效的自适应优化器表现出隐含的L∞偏置,有利于稳定性,但稀疏性需要L1偏置。为了整合稀疏性,我们提出了一种优化器步骤的组合,将其视为非交换算子,以系统的方式分析和结合其优化几何。这导致了HORST(Hyperbolic Operator for Robust Sparse Training),一种模块化优化器,继承自自适应方法的稳定性,同时通过双曲镜像映射引入L1稀疏性偏置。我们的实验表明,HORST在视觉和语言任务上的稀疏Transformer训练中具有实用性。HORST在所有稀疏性水平上都显著优于AdamW基线,特别是在高稀疏性时有显著提升。

英文摘要

Sparsifying transformers remains a fundamental challenge, as standard optimizers fail to simultaneously encourage sparsity and maintain training stability. Effective adaptive optimizers exhibit an implicit $L_{\infty}$ bias favoring stability, yet, sparsity requires an $L_1$ bias. To integrate sparsity, we propose a composition of optimizer steps, which we cast as non-commutative operators to analyze and combine their optimization geometry in a principled way. This yields HORST (Hyperbolic Operator for Robust Sparse Training), a modular optimizer that inherits stability from adaptive methods while inducing $L_1$ sparsity bias through a hyperbolic mirror map. Our experiments demonstrate its utility for sparse training of transformers on both vision and language tasks. HORST consistently and significantly outperforms AdamW baselines across all sparsity levels, with large gains at higher sparsity.

2605.21103 2026-05-21 cs.LG

A Typed Tensor Language for Federated Learning

一种用于联邦学习的类型化张量语言

Theofilos Mailis, Kalliopi-Christina Despotidou, Konstantinos Filippopolitis, Yannis Foufoulas, Thanasis-Michail Karampatsis, Andreas Ktenidis, Evdokia Mailli, Theodore Papamarkou, Yannis Ioannidis

发表机构 * Athena Research Center(阿提卡研究中心) National and Kapodistrian University of Athens(雅典国家与卡波迪斯蒂亚诺大学) National Technical University of Athens(雅典技术大学)

AI总结 本文提出了一种类型化的张量语言,用于形式化联邦学习中的结构,通过共享状态因子分解理论和可微片段,实现了联邦学习计算的正式描述。

详情
AI中文摘要

联邦学习和分析通常被描述为多个独立协议的集合,即使它们共享相同的数学形式:客户端本地张量计算、可合并到共享状态的聚合,以及仅共享的后处理。我们引入了一种类型化的张量语言,该语言正式化了这种结构。该语言区分了联邦张量,其记录在客户端之间沿跟踪的记录轴上被分割,以及共享张量,其在全球范围内可用。其语义由与虚拟全局张量的比较定义,仅用作参考对象。主要结果是共享状态因子分解理论。我们证明了类型化的单轮程序通过固定维度的共享状态因子分解,其大小与客户端和记录的数量无关,由客户端本地张量表达式计算并跨客户端合并。我们还证明了一个相反的可表示性结果;那些编码器和解码器可以由该语言表达的因子分解由类型化的单轮程序实现,并且这种对应关系扩展到迭代程序,其跨轮状态是共享的。这给出了语言中可表示的计算的正式描述,这些计算可以表示为编码、合并和解码过程。然后,我们开发了一个可微片段用于学习。如果每个记录的损失及其每个记录的梯度由客户端本地张量表达式表示,则全局梯度由记录轴求和的联邦梯度张量表示。这产生了用于服务器端梯度下降和共享线性代数二次更新的类型化迭代程序。该框架表征了一类广泛的联邦学习计算,其通信通过固定维度的共享状态传递。

英文摘要

Federated learning and analytics are often described as collections of separate protocols, even when they share the same mathematical form: client-local tensor computation, mergeable aggregation into shared state, and shared-only post-processing. We introduce a typed tensor language that formalizes this structure. The language distinguishes federated tensors, whose records are partitioned across clients along a tracked record axis, from shared tensors, which are available globally. Its semantics are defined by comparison with a virtual global tensor, used only as a reference object. The main result is a shared-state factorization theory. We show that typed one-round programs factor through fixed-dimensional shared state whose size is independent of the number of clients and records, computed from client-local tensor expressions and merged across clients. We also prove a converse representability result; factorizations whose encoders and decoders are expressible in the language are realized by typed one-round programs, and the correspondence extends to iterative programs whose cross-round state is shared. This gives a formal account of the computations in the language that can be expressed as encode, merge, and decode procedures. We then develop a differentiable fragment for learning. If a per-record loss and its per-record gradient are represented by client-local tensor expressions, the global gradient is represented by record-axis summation of the federated gradient tensor. This yields typed iterative programs for server-side gradient descent and shared-linear-algebra second-order updates. The framework characterizes a broad class of federated learning computations whose communication passes through fixed-dimensional shared state.

2605.21102 2026-05-21 cs.CL cs.AI cs.SE

ACL-Verbatim: hallucination-free question answering for research

ACL-Verbatim: 无幻觉的科研问答

Gábor Recski, Szilveszter Tóth, Nadia Verdha, István Boros, Ádám Kovács

发表机构 * TU Wien(维也纳技术大学) KR Labs(KR实验室)

AI总结 本研究提出ACL-Verbatim系统,通过提取式问答方法在科研论文中精准映射用户查询到相关文本片段,构建了新的真实数据集并训练评估了多种提取模型,最终通过150M参数的ModernBERT模型在词级F1得分上达到53.6,优于最强的LLM提取器。

Comments 13 pages

详情
AI中文摘要

学术研究者需要高效可靠的工具从可信来源获取高质量信息,但现代AI辅助研究工具仍受大语言模型(LLMs)产生事实不准确或不合逻辑输出(即幻觉)的影响。我们应用提取式问答系统VerbatimRAG到ACL Anthology中的研究论文,直接将用户查询映射到检索文档中的原文文本片段。我们贡献了一个新的真实数据集,用于将用户查询映射到科研论文中的相关文本片段,并利用该数据集训练和评估多种提取模型。人工标注由自然语言处理研究人员完成,基于使用ScIRGen方法生成的合成用户查询,配以由VerbatimRAG检索的论文片段。在该基准上,一个基于我们流水线银色监督训练的150M参数ModernBERT标记分类器在词级F1得分上达到53.6,优于最强的评估LLM提取器(48.7)

英文摘要

Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).

2605.21099 2026-05-21 cs.CV

R2AoP: Reliable and Robust Angle of Progression Estimation from Intrapartum Ultrasound

R2AoP: 从产前超声可靠且鲁棒地估计进展角

Yuanhan Wang, Yifei Chen, Beining Wu, Mingxuan Liu, Xiaotian Hu, Chunbo Jiang, Yijin Li, Changmiao Wang, Feiwei Qin, Qiyuan Tian

发表机构 * Tsinghua University(清华大学) Hangzhou Dianzi University(杭州电子科技大学) Shenzhen Research Institute of Big Data(深圳大数据研究院)

AI总结 本文提出R2AoP框架,通过结构引导的分割和置信度引导的几何建模,实现了稳定的进展角估计,同时引入轻量级几何可靠测试时适应策略以提高在异质采集条件下的性能。

Comments 11pages,4 figures,Accepted by MICCAI 2026

详情
AI中文摘要

准确地从产前经阴超声估计进展角(AoP)对于客观评估产程进展至关重要,但仍然高度敏感于成像噪声、边界模糊性和局部分割误差的几何放大。我们提出R2AoP,一种可靠且鲁棒的AoP估计框架,整合了结构引导的分割和置信度引导的几何建模,以实现稳定且可重复的测量。一个三分支局部结构增强的主干提高了耻骨联合(PS)和胎儿头(FH)的界定,而置信度加权轮廓拟合明确抑制了AoP计算中不可靠边界点的影响。为进一步提高在异质采集条件下的性能,我们引入了一种轻量级几何可靠的测试时适应策略作为辅助组件,使推理过程稳定且无需目标标注。在多中心基准上的广泛评估显示,与最先进的AoP方法相比,AoP误差和边界指标均表现出一致的减少。我们的源代码可在https://github.com/baiyou1234/R2AoP上获得。

英文摘要

Accurate estimation of the Angle of Progression (AoP) from intrapartum transperineal ultrasound is critical for objective assessment of labor progression, yet remains highly sensitive to imaging noise, boundary ambiguities, and the geometric amplification of local segmentation errors. We propose R2AoP, a reliable and robust AoP estimation framework that integrates structurally informed segmentation and confidence-guided geometric modeling to achieve stable and reproducible measurements. A three-branch local-structure-enhanced backbone improves the delineation of the pubic symphysis (PS) and fetal head (FH), while confidence-weighted contour fitting explicitly suppresses the influence of unreliable boundary points in AoP computation. To further improve performance under heterogeneous acquisition conditions, we introduce a lightweight geometry-reliable test-time adaptation strategy as an auxiliary component, enabling stable inference without target annotations. Extensive evaluations on multi-center benchmarks demonstrate consistent reductions in AoP error and boundary metrics compared with state-of-the-art AoP methods. Our source code is available at https://github.com/baiyou1234/R2AoP.

2605.21097 2026-05-21 cs.CL

WCXB: A Multi-Type Web Content Extraction Benchmark

WCXB:一个多类型网络内容提取基准

Murrough Foley

发表机构 * Murrough Foley

AI总结 本文提出WCXB基准,包含2008个网页,涵盖七种结构不同的页面类型,通过五阶段流程生成真实标注,评估13种提取系统,发现现有文章-only基准无法发现结构化页面的盲区。

Comments Dataset: github.com/Murrough-Foley/web-content-extraction-benchmark, doi.org/10.5281/zenodo.19316874. Leaderboard: webcontentextraction.org. Preprint also deposited at doi.org/10.5281/zenodo.19664685

详情
AI中文摘要

网络内容提取——从网页中隔离主要内容以排除周围模板内容——是搜索引擎索引、检索增强生成、NLP数据集构建和大语言模型训练的前提。该领域的进展受到现有评估基准的限制,这些基准规模小(100-800页)、仅限于新闻文章或基于超过十年前的网页。我们介绍了网络内容提取基准(WCXB),包含来自1613个域的2008个网页,涵盖七种结构不同的页面类型:文章、论坛、产品、集合、列表、文档和服务页面。该数据集包括1497页的开发集和511页的测试集,具有匹配的页面类型分布。真实标注通过五阶段流程生成:LLM辅助草稿、自动化验证、四轮前沿模型审查、片段和质量验证脚本以及人工审查。我们评估了13种提取系统——11种启发式和2种神经网络——发现尽管顶级系统在文章上(F1=0.93)表现良好,但在结构化页面类型上性能差异显著(F1=0.41-0.84),揭示了现有文章-only基准无法发现的盲区。该数据集以CC-BY-4.0许可证发布,包含HTML源文件、真实标注、页面类型标签和基线结果。

英文摘要

Web content extraction - isolating a page's main content from surrounding boilerplate - is a prerequisite for search indexing, retrieval-augmented generation, NLP dataset construction, and large language model training. Progress in this area has been constrained by the limitations of existing evaluation benchmarks, which are small (100-800 pages), restricted to news articles, or based on web pages from over a decade ago. We introduce the Web Content Extraction Benchmark (WCXB), a dataset of 2,008 web pages from 1,613 domains spanning seven structurally distinct page types: articles, forums, products, collections, listings, documentation, and service pages. The dataset includes a 1,497-page development set and a 511-page held-out test set with matched page type distributions. Ground truth annotations were produced through a five-stage pipeline: LLM-assisted drafting, automated verification, four-pass frontier model review, snippet and quality verification scripts, and human review. We evaluate 13 extraction systems - 11 heuristic and 2 neural - and find that while top systems converge on articles (F1 = 0.93), performance diverges sharply on structured page types (F1 = 0.41-0.84), revealing blind spots invisible to existing article-only benchmarks. The dataset is released under CC-BY-4.0 with HTML source files, ground truth annotations, page type labels, and baseline results.

2605.21094 2026-05-21 cs.LG

UOTIP: Unbalanced Optimal Transport Map for Unpaired Inverse Problems

UOTIP:用于无配对逆问题的不平衡最优传输映射

Donggyu Lee, Taekyung Lee, Jaewoong Choi

发表机构 * Sungkyunkwan University(成均馆大学) IPAI (Interdisciplinary Program in Artificial Intelligence, Seoul National University)(人工智能跨学科项目(Seoul National University)) Department of Mathematical Sciences, Seoul National University(首尔国立大学数学科学系)

AI总结 本文提出了一种基于不平衡最优传输的逆问题求解器UOTIP,通过引入基于似然的成本函数,将重建任务建模为从噪声测量分布到干净信号分布的学习过程,从而在无配对逆问题上实现了最先进的性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

我们研究了无配对图像逆问题,这是一种具有挑战性的设置,其中只有独立的、未配对的噪声测量和干净目标信号集可用进行训练。我们提出了一种基于不平衡最优传输的新型逆问题求解器,称为用于逆问题的不平衡最优传输映射(UOTIP)。我们的方法将重建任务建模为学习从噪声测量分布到干净信号分布的UOT映射,通过引入基于似然的成本函数进行预测。通过放松精确边缘约束,UOT框架为我们的模型提供了关键优势:对多级观测噪声的鲁棒性、适应噪声和干净数据集之间的类别不平衡,以及对不同噪声类型场景的泛化能力。此外,我们理论证明,引入二次成本项通过满足扭条件确保了运输映射的存在性和唯一性,即使在病态逆问题中也是如此。我们的实验表明,UOTIP在无配对图像逆问题基准上实现了最先进的性能,涵盖了线性和非线性逆问题。

英文摘要

We investigate unpaired image inverse problems, a challenging setting where only independent, non-paired sets of noisy measurements and clean target signals are available for training. We propose a novel inverse problem solver based on Unbalanced Optimal Transport, called Unbalanced Optimal Transport Map for Inverse Problems (UOTIP). Our method formulates the reconstruction task, predicting clean target signals from noisy measurements, as learning a UOT Map from noisy measurement distribution to clean signal distribution by incorporating a likelihood-based cost function. By relaxing the exact marginal constraint, the UOT framework provides key advantages to our model: robustness to multi-level observation noise, adaptability to class imbalance between noisy and clean datasets, and generalizability to diverse noise-type scenarios. Furthermore, we theoretically demonstrate that incorporating a quadratic cost term ensures the existence and uniqueness of the transport map by satisfying the twist condition, even for ill-posed inverse problems. Our experiments demonstrate that UOTIP achieves state-of-the-art performance on unpaired image inverse problem benchmarks, across linear and nonlinear inverse problems.

2605.21090 2026-05-21 cs.CV

TextSculptor: Training and Benchmarking Scene Text Editing

TextSculptor: 训练和评估场景文本编辑

Yiheng Lin, Siyu Jiao, Xiaohan Lan, Wei Zhou, Qi She, Fei Yu, Heyun Chen, Zhengwei Wang, Jinghuan Chen, Moran Li, Yingchen Yu, Zijian Feng, Yao Zhao, Yunchao Wei, Yujie Zhong

发表机构 * Beijing Jiaotong University(北京交通大学) Bytedance(字节跳动)

AI总结 本文提出TextSculptor框架,通过构建大规模数据集和基准测试,解决场景文本编辑中高质量训练数据稀缺和缺乏标准化评估的问题,提升开源模型性能。

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)和基于扩散的生成模型的进展显著提升了基于提示的图像编辑能力。然而,场景文本编辑仍具挑战性,因为模型需要精确修改文本内容,同时保持视觉真实性和非目标区域的完整性。当前开源模型仍落后于专有系统,主要由于高质量训练数据稀缺和缺乏针对文本编辑的标准化基准。为解决这些问题,我们提出了TextSculptor,一个全面的场景文本编辑数据构建和评估框架。我们首先开发了一个自动化数据构建管道,结合文本感知图像合成、程序化文本渲染和合成。基于此管道,我们构建了TextSculpt-Data,一个包含320万训练样本的大规模数据集,包括120万经过OCR验证的文本到图像样本和200万配对的文本编辑样本,具有自然对齐的源-目标图像和强背景一致性。我们进一步引入了TextSculpt-Bench,涵盖四个基本文本编辑任务:文本添加、文本替换、文本删除和混合编辑。为了支持可靠的评估,我们设计了一个定制协议,通过OCR文本对齐、多模态判断和背景区域相似性测量文本准确性、视觉质量和背景保持。广泛的实验表明,TextSculptor提升了开源文本编辑性能,缩小了与专有模型之间的差距。数据和基准可在https://github.com/linyiheng123/TextSculptor获取。

英文摘要

Recent advances in Multimodal Large Language Models (MLLMs) and diffusion-based generative models have substantially improved prompt-driven image editing. However, scene text editing remains challenging, as it requires models to precisely modify textual content while preserving visual realism and non-target regions. Current open-source models still lag behind proprietary systems, largely due to the scarcity of high-quality training data and the lack of standardized benchmarks tailored to text editing. To address these challenges, we present TextSculptor, a comprehensive framework for data construction and evaluation of scene text editing. We first develop an automated data construction pipeline that combines text-aware image synthesis with programmatic text rendering and compositing. Based on this pipeline, we build TextSculpt-Data, a large-scale dataset containing 3.2M training samples, including 1.2M OCR-verified text-to-image samples and 2M paired text editing samples with naturally aligned source-target images and strong background consistency. We further introduce TextSculpt-Bench, a benchmark covering four fundamental text editing tasks: text addition, text replacement, text removal, and hybrid editing. To support reliable evaluation, we design a tailored protocol that measures text accuracy, visual quality, and background preservation through OCR-based text alignment, multimodal judgment, and background-region similarity. Extensive experiments show that TextSculptor improves open-source text editing performance and narrows the gap to proprietary models. The data and benchmark are available at https://github.com/linyiheng123/TextSculptor.

2605.21088 2026-05-21 cs.LG

Reviving Error Correction in Modern Deep Time-Series Forecasting

在现代深度时间序列预测中复兴误差校正

Minh Hoang Nguyen, Dai Do, Huu Hiep Nguyen, Dung Nguyen, Kien Do, Hung Le

发表机构 * Deakin Applied AI Initiative, Deakin University, Australia(德克萨斯应用人工智能倡议,德克萨斯大学,澳大利亚)

AI总结 本文研究了深度时间序列预测中的误差累积问题,提出了一种通用误差校正器,通过分解趋势和季节性成分来提升预测的准确性和鲁棒性。

Comments 27 pages

详情
AI中文摘要

现代深度学习模型在时间序列预测中取得了显著成功。然而,由于自回归推理中的误差累积,其在长期预测中的性能会下降。尽管经典的误差校正机制(ECMs)长期以来被用于统计方法,但它们在深度学习模型中的应用仍然有限或无效。在本文中,我们重新审视了深度时间序列预测中的误差累积问题,并探讨了ECMs在此新背景中的作用和必要性。我们提出了一种简单、架构无关的误差校正模型,可以与任何现有的预测器集成,而无需重新训练。通过显式地将预测分解为趋势和季节性成分,并分别训练校正器来调整每个成分,我们引入了具有季节-趋势分解的通用误差校正器(UEC-STD),在4种骨干网络和10个数据集上显著提高了校正精度和鲁棒性。我们的发现提供了一种实用工具来增强预测,同时为减轻深度时间序列模型中的自回归误差提供了新的见解。代码可在https://github.com/DA2I2-SLM/UEC-STD上获得。

英文摘要

Modern deep-learning models have achieved remarkable success in time-series forecasting. Yet, their performance degrades in long-term prediction due to error accumulation in autoregressive inference, where predictions are recursively used as inputs. While classical error correction mechanisms (ECMs) have long been used in statistical methods, their applicability to deep learning models remains limited or ineffective. In this work, we revisit the error accumulation problem in deep time-series forecasting and investigate the role and necessity of ECMs in this new context. We propose a simple, architecture-agnostic error correction model that can be integrated with any existing forecaster without requiring retraining. By explicitly decomposing predictions into trend and seasonal components and training the corrector to adjust each separately, we introduce the Universal Error Corrector with Seasonal-Trend Decomposition (UEC-STD), which significantly improves correction accuracy and robustness across 4 backbones and 10 datasets. Our findings provide a practical tool for enhancing forecasts while offering new insights into mitigating autoregressive errors in deep time-series models. Code is available at https://github.com/DA2I2-SLM/UEC-STD.

2605.21086 2026-05-21 cs.CL

LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control

LoCar: 通过细粒度社会语言学控制评估车载助手的本地化

Seogyeong Jeong, Kiwoong Park, Seyoung Song, Eunsu Kim, Ken E. Friedl, Jaeho Kim, Alice Oh

发表机构 * KAIST(韩国科学技术院) BMW Group(宝马集团)

AI总结 本文提出了一种新的车载助手评估框架,专注于韩语本地化,揭示了当前LLM在细粒度韩语敬语控制方面的不稳定性以及策略性对话指标上的表现不足,强调了汽车AI需要向精确语言定制和可靠安全交互管理发展。

Comments To appear in ACL 2026 Industry Track

详情
AI中文摘要

尽管大型语言模型(LLMs)越来越多地集成到车载对话系统中,但由于缺乏针对实际部署需求定制的领域特定评估标准,确定最优模型仍具挑战性。在本文中,我们提出了一种新的车载助手评估框架,特别关注韩语本地化。我们的实证分析揭示了模型行为中的显著模式。首先,细粒度韩语敬语控制在当前LLM中仍然不稳定,表明在本地化设置中必须明确评估精确的语音层面实现。其次,模型在战略对话指标如澄清和主动性方面表现较弱。我们的分析表明,这源于这些任务本身的主观复杂性,其中我们的框架采取保守的评估立场以优先考虑可靠性。总体而言,我们的发现强调,汽车AI必须超越一般能力,向精确语言定制和可靠、以安全为导向的交互管理发展。

英文摘要

While Large Language Models (LLMs) are increasingly integrated into in-vehicle conversational systems, identifying the optimal model remains challenging due to the lack of domain-specific evaluation standards tailored to real-world deployment requirements. In this paper, we propose a novel evaluation framework for in-vehicle assistants, with a particular focus on Korean-language localization. Our empirical analysis reveals notable patterns in model behavior. First, fine-grained Korean honorific control remains unstable in current LLMs, indicating that precise speech-level realization must be explicitly evaluated in localization settings. Second, models exhibit weaker performance in strategic conversational metrics like clarification and proactivity. Our analysis suggests this stems from the inherent subjective complexity of these tasks, where our framework adopts a conservative evaluation stance to prioritize reliability. Together, our findings underscore that automotive AI must move beyond general competence toward precise linguistic tailoring and reliable, safety-oriented interaction management.

2605.21082 2026-05-21 cs.AI

AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

AutoRPA: 通过基于LLM的代码合成实现高效的GUI自动化

Minghao Chen, Xinyi Hu, Zhou Yu, Yufei Yin

发表机构 * Zhejiang Key Laboratory of Space Information Sensing and Transmission(浙江空间信息感知与传输重点实验室) School of Computer Science, Hangzhou Dianzi University(杭州电子科技大学计算机科学学院)

AI总结 本文提出AutoRPA框架,通过将ReAct风格代理的决策逻辑自动转化为鲁棒的RPA功能,提高GUI自动化效率和可重用性,同时减少82%到96%的token使用量。

Comments Accepted in ICML 2026

详情
AI中文摘要

基于大型语言模型(LLM)的代理在多步骤的图形用户界面(GUI)交互中表现出色。尽管大多数研究集中在提升单任务性能,但实际场景中往往涉及重复的GUI任务,而频繁调用LLM推理(即ReAct范式)效率低下。在LLM之前,传统的机器人流程自动化(RPA)提供运行时效率,但需要大量手动努力来开发和维护。为弥合这一差距,我们提出AutoRPA框架,该框架能够自动将ReAct风格代理的决策逻辑转化为鲁棒的RPA功能。AutoRPA引入了两个核心创新:(1)一个翻译-构建流水线,其中翻译代理将硬编码的ReAct动作转换为软编码的流程,构建代理通过多轨迹检索增强生成合成鲁棒的RPA功能;(2)在代码验证期间的混合修复策略,结合RPA执行与基于ReAct的回退机制进行迭代优化。在多个GUI环境中的实验表明,由AutoRPA生成的RPA功能能够成功解决类似任务,同时减少82%到96%的token使用量,显著提高运行时效率和可重用性。

英文摘要

Large Language Model (LLM) based agents have demonstrated proficiency in multi-step interactions with graphical user interfaces (GUIs). While most research focuses on improving single-task performance, practical scenarios often involve repetitive GUI tasks for which invoking LLM reasoning repeatedly, i.e., the ReAct paradigm, is inefficient. Prior to LLMs, traditional Robotic Process Automation (RPA) offers runtime efficiency but demands significant manual effort to develop and maintain. To bridge this gap, we propose AutoRPA, a framework that automatically distills the decision logic of ReAct-style agents into robust RPA functions. AutoRPA introduces two core innovations: (1) A translator-builder pipeline, where a translator agent converts hard-coded ReAct actions into soft-coded procedures, and a builder agent synthesizes robust RPA functions via retrieval-augmented generation over multiple trajectories; (2) A hybrid repair strategy during code verification, combining RPA execution with ReAct-based fallback for iterative refinement. Experiments across multiple GUI environments demonstrate that RPA functions generated by AutoRPA successfully solve similar tasks while reducing token usage by 82% to 96%, significantly improving runtime efficiency and reusability.

2605.21081 2026-05-21 cs.SD cs.LG

Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model

音乐注意力转换器:使用音乐特定的注意力模型进行音乐生成

Shinnosuke Taksuka, Hideo Mukai

发表机构 * Department of Computer Science, School of Science and Technology, Meiji University(计算机科学系,科学与技术学部,立命馆大学)

AI总结 本文提出了一种音乐特定的注意力模型,通过整合元信息来提升音乐生成的质量,核心方法是将音乐结构和元数据结合,主要贡献是提高了生成音乐的连贯性和多样性。

Comments 32 pages, 13 figures

详情
AI中文摘要

本研究旨在通过引入元信息来提升使用Transformer进行音乐生成的质量。尽管基于Transformer的方法在捕捉音乐作品中的长期依赖性方面有效,但它们生成的音乐常出现重复或音符重复的问题,导致不自然的旋律。为了解决这些限制,我们提出了音乐注意力机制,该机制将元信息如小节号、调性、节拍等整合到注意力过程中。音乐注意力显式利用音乐的结构属性及其相关元数据,使Transformer的注意力机制能够更有效地运作,从而提高生成输出的质量。在我们的框架中,每个音乐音符被表示为五个事件(音高、小节号、起始时间、持续时间和力度)以及三个元数据元素的组合。然后将注意力机制修改为反映这些八个特征之间的相关性,使模型能够更好地捕捉音乐编排的内在特性。实验结果表明,整合音乐注意力的模型在音乐连贯性、变化性和整体质量方面优于先前的方法,如全注意力和步进注意力。值得注意的是,它显著减少了重复并增强了模型生成多样化、和谐一致的旋律的能力。音乐注意力因此在AI驱动的音乐生成中代表了重要的进展,有助于创建更自然和富有表现力的音乐作品。

英文摘要

This study aims to enhance the quality of music generation using Transformers by incorporating meta-information. While Transformer-based approaches are effective at capturing long-term dependencies in musical compositions, the music they generate often suffers from issues such as excessive repetition or duplication of notes, leading to unnatural melodies. To address these limitations, we propose Musical Attention, a mechanism that incorporates meta-information such as bar numbers, key, signatures, and tempos into the attention process. Musical Attention explicitly leverages both the structural properties of music and its associated metadata, enabling the Transformer's attention mechanism to operate more effectively and thereby improving the quality of the generated output. In our framework, each musical note is represented as a combination of five events-pitch, bar number, onset, duration, and velocity in addition to the three metadata elements. The attention mechanism is then modified to reflect the correlations among these eight features, allowing the model to better capture the inherent characteristics of musical composition. Experimental results demonstrate that the model incorporating Musical Attention outperforms prior methods, such as Full Attention and Strided Attention, in terms of musical coherence, variation, and overall quality. Notably, it significantly reduces repetition and enhances the model's ability to generate diverse, harmonically consistent melodies. Musical Attention thus represents a meaningful advancement in AI-driven music generation, facilitating the creation of more natural and expressive compositions.

2605.21076 2026-05-21 cs.CL

GradeLegal: Automated Grading for German Legal Cases

GradeLegal: 自动化评分德国法律案例

Abdullah Al Zubaer, Lorenz Wendlinger, Simon Alexander Nonn, Michael Granitzer, Jelena Mitrovic

发表机构 * University of Passau(帕绍大学) Deggendorf Institute of Technology(德格多夫技术学院)

AI总结 本研究探讨了大型语言模型能否用于自动化评分德国刑事和公法案例解决方案,通过系统评估27个专有和开源LLM,发现基于样本解决方案和评分标准的提示策略能有效模拟专家评分,尤其在公法领域达到0.91的QWK值,而在刑事法律领域仅为0.60,表明刑事法律评分任务更难。此外,集成方法能进一步提高评分一致性,并为更强的闭源单模型提供替代方案。

详情
AI中文摘要

对德国法律考试解决方案进行评分面临日益增长的体积和合格评分员短缺,导致反馈延迟并形成瓶颈。同时,这是一个高风险的专家任务,因为国家考试成绩强烈影响德国的职业发展。尽管具有实际相关性,文献中缺乏系统研究有效评分法律考试的方法。为解决这一差距,我们研究了大型语言模型(LLMs)是否能支持自动化评分德国刑事和公法案例解决方案,从而实现可扩展的反馈和学生自我测试。我们系统评估了27个专有和开源LLM,并通过基准测试提示策略,逐步增加任务相关信息,如样本解决方案和评分标准。使用二次加权κ(QWK),基于推理的LLM在获得样本解决方案和评分标准时,能在公法领域模拟专家评分(最高0.91),而刑事法律领域仅为0.60,表明刑事法律评分任务更难。除了单模型评分外,集成方法能通过其最佳成员提高一致性高达0.15,并可为更强的闭源单模型提供替代方案。此外,我们的发现表明,有效的提示设计和模型选择对于可靠的LLM基于法律考试评分至关重要。

英文摘要

Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91), compared to 0.60 in criminal law, suggesting a harder grading task in criminal law. Beyond single-model grading, ensembling improves agreement by up to 0.15 over its best member and can offer an alternative to stronger closed-source single models. In addition, our findings suggest that effective prompt design and model selection are necessary for reliable LLM-based grading of legal exams.

2605.21075 2026-05-21 cs.CV cs.LG

SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining

SpectralEarth-FM: 将高光谱图像引入多模态地球观测预训练

Nassim Ait Ali Braham, Aaron Banze, Conrad M. Albrecht, Julien Mairal, Jocelyn Chanussot, Xiao Xiang Zhu

发表机构 * Chair of Data Science in Earth Observation(地球观测数据科学主任) Technical University of Munich(慕尼黑技术大学) Remote Sensing Technology Institute(遥感技术研究所) German Aerospace Center (DLR)(德国航空航天中心) Department of Aerospace Engineering(航空航天工程系) University of the Bundeswehr Munich(联邦国防军慕尼黑大学) LEAP Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Univ. Grenoble Alpes(格勒诺布尔阿尔卑斯大学) Inria(法国国家信息与自动化技术研究院) CNRS(法国国家科学研究中心) Grenoble INP(格勒诺布尔INP) LJK

AI总结 本文提出SpectralEarth-FM,一种用于多传感器地球观测输入的分层变压器,旨在联合处理高光谱图像与低通道观测。通过构建SpectralEarth-MM数据集,采用JEPA风格的目标进行预训练,实现了在高光谱下游任务和标准EO基准上的最佳性能。

详情
AI中文摘要

地球观测(EO)基础模型(FMs)越来越多地使用多传感器数据进行训练,涵盖多谱段图像(MSI)、合成孔径雷达(SAR)和衍生的地理空间层,但高光谱图像(HSI)仍被低估。相反,现有的高光谱FM仅在HSI上训练,未探索HSI与共定位EO传感器的联合预训练和融合。我们引入SpectralEarth-FM,一种用于多传感器EO输入的分层变压器,具有异构光谱维度。该架构结合了高光谱输入的光谱标记化、传感器特定编码器、跨传感器融合模块和共享分层编码器,能够联合处理HSI和低通道观测。为了预训练SpectralEarth-FM,我们构建了SpectralEarth-MM数据集,该数据集将EnMAP、EMIT、DESI三颗空间载荷的HSI与Sentinel-2、Landsat-8/9光学图像、Landsat地表温度(LST)和Sentinel-1 SAR在共同地理足迹上进行共定位。该数据集包含约2000万个全球分布的地点,25000万个地理参考碎片,以及超过40TB的数据。预训练使用一种联合嵌入预测架构(JEPA)风格的目标,匹配全球视图和同一地点单传感器局部视图之间的表示。我们评估了SpectralEarth-FM在高光谱下游任务和标准EO基准上的性能,遵循PANGAEA协议,实现了在两种评估设置中的最佳性能。

英文摘要

Earth observation (EO) foundation models (FMs) are increasingly trained on multisensor data, spanning multispectral imagery (MSI), synthetic aperture radar (SAR), and derived geospatial layers, but hyperspectral imagery (HSI) remains underrepresented. Conversely, existing hyperspectral FMs are trained on HSI alone, leaving joint pretraining and fusion of HSI with co-located EO sensors unexplored. We introduce SpectralEarth-FM, a hierarchical transformer for multisensor EO input with heterogeneous spectral dimensionality. The architecture combines spectral tokenization for hyperspectral inputs, sensor-specific encoders, a cross-sensor fusion module, and a shared hierarchical encoder, enabling joint processing of HSI and lower-channel observations. To pretrain SpectralEarth-FM, we curate SpectralEarth-MM, a dataset that co-locates HSI from three spaceborne sensors (EnMAP, EMIT, DESIS) with Sentinel-2, Landsat-8/9 optical imagery, Landsat land surface temperature (LST), and Sentinel-1 SAR, over common geographic footprints. It comprises approximately 2M globally distributed locations, 25M georeferenced patches, and over 40TB of data. Pretraining uses a Joint-Embedding Predictive Architecture (JEPA)-style objective that matches representations between global views and single-sensor local views from the same location. We evaluate SpectralEarth-FM on hyperspectral downstream tasks and standard EO benchmarks following the PANGAEA protocol, achieving state-of-the-art results across both evaluation settings.

2605.21072 2026-05-21 cs.CV

Q-ARVD: Quantizing Autoregressive Video Diffusion Models

Q-ARVD: 对自回归视频扩散模型进行量化

Siao Tang, Xinyin Ma, Gongfan Fang, Xingyi Yang, Xinchao Wang

发表机构 * National University of Singapore(新加坡国立大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文针对自回归视频扩散模型(ARVD)的量化问题,提出了一种新的框架Q-ARVD,解决了帧间量化敏感度不平衡和权重中异质性异常模式的问题,从而提高了模型效率。

Comments Code: https://github.com/tsa18/Q-ARVD

详情
AI中文摘要

自回归视频扩散模型(ARVD)已涌现出作为流式视频生成的有前景的架构,为实时交互视频生成和世界建模铺平了道路。尽管具有潜力,ARVDs的显著推理成本仍然是实际部署的主要障碍,使模型量化成为提高效率的自然方向。然而,ARVDs的量化仍鲜有研究。我们的实证分析表明,直接应用现有为标准扩散变压器开发的量化方案到ARVDs会导致性能不佳,揭示了与双向扩散模型观察到的量化行为不同的特性。在本文中,我们识别了量化ARVDs的两个关键挑战:(C1)高度不平衡的帧级量化敏感度。在自回归生成过程中,误差积累可以导致帧间严重的量化敏感度偏斜,遵循指数衰减模式。(C2)权重中显著的异质性异常模式。权重分布表现出明显的异常通道,其模式在层类型和块深度上变化很大。为了解决这些问题,我们提出了Q-ARVD,一种用于准确ARVD量化的新型框架。(S1)为解决高度不平衡的帧级敏感度,Q-ARVD将最终质量感知的帧加权机制纳入量化目标中。(S2)为防止异质性异常影响性能,Q-ARVD引入了异常感知的自适应双尺度量化,该方法可以自动检测任意层中异常通道的存在和数量,并将其隔离以保护正常通道。广泛的实验展示了Q-ARVD的优越性。

英文摘要

Autoregressive video diffusion models (ARVDs) have emerged as a promising architecture for streaming video generation, paving the way for real-time interactive video generation and world modeling. Despite their potential, the substantial inference cost of ARVDs remains a major obstacle to practical deployment, making model quantization a natural direction for improving efficiency. However, quantization for ARVDs remains largely unexplored. Our empirical analysis shows that directly applying existing quantization schemes developed for standard diffusion transformers to ARVDs leads to suboptimal performance, revealing quantization behaviors that differ from those observed in bidirectional diffusion models. In this paper, we identify two critical challenges in quantizing ARVDs: (C1) Highly unbalanced frame-wise quantization sensitivity. Error accumulation during autoregressive generation can induce severely skewed quantization sensitivity across frames, following an exponential-like decay pattern. (C2) Prominent and heterogeneous outlier patterns in weights. Weight distributions exhibit pronounced outlier channels, whose patterns vary substantially across layer types and block depths. To address these issues, we propose Q-ARVD, a novel framework for accurate ARVD quantization. (S1) To tackle the highly unbalanced frame-wise sensitivity, Q-ARVD incorporates a final-quality aware frame-weighting mechanism into the quantization objective. (S2) To prevent heterogeneous outliers from degrading performance, Q-ARVD introduces an outlier-aware adaptive dual-scale quantization, which automatically detects the presence and quantity of outlier channels for an arbitrary layer, and isolates them to protect normal channels. Extensive experiments demonstrate the superiority of Q-ARVD.

2605.21070 2026-05-21 cs.LG

Towards Understanding Self-Pretraining for Sequence Classification

向序列分类中的自预训练理解迈进

Omar Coser, Loredana Zollo, Paolo Soda, Antonio Orvieto

发表机构 * Unit of Artificial Intelligence & Computer Systems, Università Campus Bio-Medico di Roma(人工智能与计算机系统单位,罗马生物医学学院) Unit of Advanced Robotics and Human-Centered Technologies, Università Campus Bio-Medico di Roma(先进机器人与以人为本技术单位,罗马生物医学学院) Department of Diagnostics and Intervention, Radiation Physics, Biomedical Engineering, Umeå University(诊断与介入部门,辐射物理,生物医学工程,乌梅拉大学) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) ELLIS Institute Tübingen(图宾根ELLIS研究所) Tübingen AI Center(图宾根人工智能中心)

AI总结 本文通过复制和系统消融Amos等人的研究,揭示了自预训练(SPT)在序列分类中提升性能的关键因素,发现标签监督在学习有用的查询-键注意力模式方面存在瓶颈,并通过简化理论框架证明了自预训练通过学习接近性交互来提升性能。

Comments v1: Preliminary, extension of the version accepted at ICML 2025 Workshop MOSS

详情
AI中文摘要

Amos等人(2024)表明,通过首先使用掩码标记预测目标进行预训练,可以在不使用外部数据或增强的情况下显著提高Transformer模型在序列分类中的准确性,这一过程称为自预训练(SPT)。尽管Amos等人(2024)的主要目标是展示Transformer在Long-Range Arena(LRA)上的强大性能,但他们的流程引发了更多根本性问题:SPT如何驱动优化以获得更好的解决方案?为什么标准监督训练在Transformer中会失效?为了更好地理解这一点,我们复制并系统消除了Amos等人(2024)的发现。我们的消融分析表明,在研究的设置中,关键瓶颈并非深度或泛化本身,而是标签监督在随机初始化下学习有用查询-键注意力模式的能力。在最小化设置中,我们识别出学习接近性交互——将绝对位置编码转换为接近性偏置的注意力分数——是SPT带来的改进的关键来源。最后,在简化理论框架中,我们证明标签监督在某些注意力分数方向上可能是局部盲目的,而这些方向可以通过掩码重建来检测。

英文摘要

Amos et al. (2024) showed that the accuracy of Transformer models in sequence classification can be significantly improved by first pretraining with a masked token prediction objective without external data or augmentation, a procedure referred to as self-pretraining (SPT). While the primary objective of Amos et al. (2024) was to showcase that Transformers can achieve strong performance on the Long-Range Arena (LRA), their pipeline raises more fundamental questions: How does SPT drive optimization to better solutions? Why can standard supervised training fail in Transformers? To better understand this, we replicate and systematically ablate the findings of Amos et al. (2024). Our ablations suggest that a central bottleneck in the studied settings is not depth or generalization alone, but the ability of label supervision to learn useful query-key Attention patterns from random initialization. With a minimal setup, we identify learning proximity interactions - turning absolute positional encodings into proximity-biased Attention scores - as a key source of the improvements brought by SPT. Finally, in a simplified theoretical setup, we show that label supervision can be locally blind to certain Attention-score directions that are instead detectable through masked reconstruction.

2605.21066 2026-05-21 cs.LG

Robust Personalized Recommendation under Hidden Confounding in MNAR

在MNAR中具有隐藏混杂因素的鲁棒个性化推荐

Zongyu Li, Wanting Su, Tianyu Xia

发表机构 * Guangdong University of Technology(广东工业大学) Chinese Academy of Sciences(中国科学院) Peking University(北京大学)

AI总结 本文提出了一种新的框架,通过估计用户-项目层面的敏感度界限,缓解了全局敏感度界限中固有的同质性假设,从而在存在隐藏混杂因素的情况下实现更鲁棒和准确的个性化推荐。

详情
AI中文摘要

推荐系统通常依赖于观察到的用户-项目交互数据,这些数据由于用户对项目的有选择性交互而容易产生选择偏差。逆概率加权和双重稳健估计器在观察到的混杂因素下有效缓解了选择偏差,但在存在隐藏混杂因素的情况下不可靠。现有的方法依赖于随机对照试验(RCTs)或全局敏感度界限,在实践中受到限制:RCTs需要昂贵的实验数据,而全局敏感度界限假定通过敏感性分析,未测量的混杂因素对倾向性的影响是均匀有界的,从而忽视了用户-项目交互中的异质性。为克服这一限制,我们提出了一种新的框架,该框架估计用户-项目层面的敏感度界限,从而显著放宽了全局敏感度界限中固有的同质性假设,称为个性化未观察混杂因素意识交互去混杂(PUID)。为确保鲁棒性和预测准确性,我们进一步开发了对抗优化策略,并提出了一个基准引导的变体(BPUID),该变体结合了预训练模型作为稳定参考。在三个真实世界数据集上的广泛实验表明,我们的方法在存在隐藏混杂因素的情况下显著优于全局方法,且不需要RCT数据。

英文摘要

Recommender systems often rely on observational user--item interaction data, which is prone to selection bias due to users' selective interactions with items. Inverse propensity weighting and doubly robust estimators effectively mitigate selection bias under observed confounding, but are unreliable in the presence of hidden confounders. Existing approaches relying on randomized controlled trials (RCTs) or global sensitivity bounds are constrained in practice: RCTs demand costly experimental data, while global sensitivity bounds presume a uniformly bounded effect of unmeasured confounders on propensities through sensitivity analysis, thereby neglecting heterogeneity across user--item interactions. To overcome this limitation, we propose a novel framework, which estimates user--item level sensitivity bounds, thereby substantially relaxing the homogeneity assumption inherent in global sensitivity bounds named Personalized Unobserved-Confounding-aware Interaction Deconfounder (PUID). To ensure both robustness and predictive accuracy, we further develop an adversarial optimization strategy and propose a benchmark-guided variant (BPUID) that incorporates pre-trained models as stabilizing references. Extensive experiments on three real-world datasets demonstrate that our approach significantly outperforms global methods under hidden confounding, without requiring RCT data.

2605.21063 2026-05-21 cs.CL

APM: Evaluating Style Personalization in LLMs with Arbitrary Preference Mappings

APM:通过任意偏好映射评估大语言模型中的风格个性化

Philipp Spohn, Leander Girrbach, Zeynep Akata

发表机构 * Technical University of Munich, Helmholtz Munich(慕尼黑技术大学,亥姆霍兹慕尼黑) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 本研究提出APM基准,通过隐式偏好映射评估大语言模型的风格个性化能力,发现路由方法是最可靠的方法,而RAG和软提示优化在强基础模型上才有提升。

详情
AI中文摘要

典型的LLM响应往往遵循默认风格,尽管用户对语气、详尽程度和正式程度有不同偏好,但这些偏好并未在提示中明确表达。评估个性化方法是否能适应这些隐式偏好具有挑战性,因为用户通常提供提示而非参考响应,风格偏好无法事实验证,且无参考的LLM评判可能将个性化与一般响应质量混淆。为解决这些挑战,我们引入了任意偏好映射(APM)基准,通过隐式随机映射C将用户属性(如热情)与响应原则(如说服力)解耦。由于C不包含语义内容且在每次运行中重新采样,模型无法利用刻板印象关联,必须从对话历史中推断偏好。使用这种无偏的评估方法,我们适配了检索增强、提示优化和路由个性化方法,并在Llama-3.1-8B和Qwen-3.5-27B上进行评估。我们的结果表明,路由是最佳方法,而RAG仅在更强的基础LLM上有所提升,软提示优化在非个性化基线上提升不显著。我们的广泛评估表明,在这种现实设置中,个性化仍然具有挑战性,但我们的适配方法显示出前景。

英文摘要

Typical LLM responses tend to follow a default style, even though users often have distinct preferences regarding tone, verbosity, and formality that they do not explicitly state in their prompts. Evaluating whether personalization methods can adapt to these implicit preferences is challenging, since users typically provide prompts rather than reference responses, style preferences are not factually verifiable, and reference-free LLM judges may conflate personalization with general response quality. To address these challenges, we introduce the Arbitrary Preference Mapping (APM) benchmark, which decouples user attributes (e.g. enthusiastic) from response principles (e.g. persuasive) via a hidden, randomized mapping $\mathbf{C}$ that maps user attributes to preferences about response traits. Because $\mathbf{C}$ carries no semantic content and is resampled across runs, models cannot exploit stereotypical associations and must infer preferences from conversation history. Using this unbiased evaluation methodology, we adapt retrieval-augmented, prompt-optimization, and routing personalization methods and evaluate them on Llama-3.1-8B and Qwen-3.5-27B. Our results show that routing is the most reliable approach, while RAG only improves with the stronger base LLM, and soft prompt optimization fails to improve significantly over a non-personalized baseline. Our extensive evaluation reveals that in this realistic setting, personalization remains challenging, but our adapted methods show promise.

2605.21061 2026-05-21 cs.CV cs.AI cs.RO

Grounding Driving VLA via Inverse Kinematics

通过逆运动学接地驾驶VLA

Junsung Park, Hyunjung Shim

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院) KimJaeChul AI Graduate School(金 JaeChul人工智能研究生院)

AI总结 本文提出通过逆运动学求解器重新设计驾驶VLA,以解决轨迹预测中对视觉token的忽略问题,通过引入视觉状态预测和逆运动学网络,提升了视觉接地和轨迹规划性能。

详情
AI中文摘要

现有驾驶VLA在预测轨迹时大多忽略其视觉token--这一现象我们归因于任务公式结构上不合理的设定而非训练不足。我们证明,当通过逆运动学视角看待轨迹恢复时,需要当前和未来视觉状态作为边界条件;现有VLA仅提供前者,促使模型依赖自身状态和文本指令进行捷径预测。为解决此问题,我们重新设计驾驶VLA,使其风格类似于逆运动学求解器。首先,一个需要LLM预测未来视觉场景的下一视觉状态预测目标提供密集的视觉监督并抑制捷径路径。其次,一个单独的逆运动学网络(基于交叉注意力的条件扩散模型)仅输入当前和未来视觉状态,以在轨迹解码过程中抑制对自身状态和文本捷径的依赖。仅通过这种简单的处方,我们的0.5B规模模型恢复了视觉接地能力,并在闭合回路NAVSIM-v2和nuScenes基准上,其轨迹规划性能可与7B-8B规模的VLA相媲美。进一步的分析表明,这种改进源于恢复了利用视觉特征的能力,效果在动态驾驶场景如转弯时尤为明显。

英文摘要

Existing Driving VLAs predict trajectories while largely ignoring their visual tokens -- a phenomenon we trace not to insufficient training but to a structurally ill-posed task formulation. We show that trajectory recovery, when viewed through the lens of inverse kinematics, requires both a current and a future visual state as boundary conditions; existing VLAs supply only the former, which encourages the model to shortcut through ego status and text commands alone. To address this, we re-design Driving VLA in the style of an inverse kinematics solver. First, a next visual state prediction objective that requires the LLM to predict the future visual scene provides dense visual supervision and suppresses shortcut paths. Second, a separate Inverse Kinematics Network (a cross-attention-based conditional diffusion model) that takes only the current and future visual states as input is designed to suppress reliance on ego status and textual shortcuts during trajectory decoding. With this simple prescription alone, our 0.5B-scale model recovers visual grounding and reaches trajectory planning performance comparable to 7B--8B VLAs more than an order of magnitude larger, on both the closed-loop NAVSIM-v2 and the nuScenes benchmarks. Extensive analysis further shows that this improvement stems from a recovered ability to exploit visual features, with the effect being most pronounced in dynamic driving situations such as turning.

2605.21060 2026-05-21 cs.LG cs.AI stat.ML

Divide et Calibra: Multiclass Local Calibration via Vector Quantization

Divide et Calibra: 通过向量量化实现多类局部校准

Cesare Barbera, Lorenzo Perini, Giovanni De Toni, Andrea Passerini, Andrea Pugnana

发表机构 * University of Pisa(比萨大学) University of Trento(特伦托大学) Meta(Meta公司) Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会)

AI总结 本文提出了一种复合方法,通过向量量化诱导表示空间的结构划分,并利用Dirichlet浓度的参数化实现跨区域参数共享,从而学习出能泛化到稀疏区域的异质校准映射,提升了局部校准性能同时保持了全局校准和预测性能。

详情
AI中文摘要

在高风险场景中,准确且校准良好的机器学习(ML)模型是必需的,但有效的多类校准仍然具有挑战性:全局方法假设校准误差在潜在空间中是同质的,而局部方法通常依赖于潜在空间降维,导致信息丢失。为了解决这些问题,我们提出了一种多类校准的复合方法,其中区域特定的校准映射是从共享的码字依赖因素中构建的。我们通过向量量化(VQ)实现这一想法,它诱导了表示空间的结构划分,并利用Dirichlet浓度的参数化实现跨区域参数共享。我们的方法学习了能泛化到稀疏区域的异质校准映射。在基准数据集上的实验显示,在保持竞争性的全局校准和预测性能的同时,显著提高了局部校准性能。

英文摘要

Accurate and well-calibrated Machine Learning (ML) models are mandatory in high-stakes settings, yet effective multiclass calibration remains challenging: global approaches assume calibration errors are homogeneous across the latent space, while local methods often rely on latent-space dimensionality reduction, which leads to information loss. To address these issues, we propose a compositional approach to multiclass calibration, where region-specific calibration maps are constructed from shared codeword-dependent factors. We instantiate this idea via Vector Quantization (VQ), which induces a structured partition of the representation space, and an indexed parameterization of Dirichlet concentrations that enables parameter sharing across regions. Our approach learns heterogeneous calibration maps that generalize well even to sparse regions of the latent space. Experiments on benchmark datasets show significant improvements in local calibration while maintaining competitive global calibration and predictive performance.