arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2605.08084 2026-05-11 cs.RO cs.CV

123D: Unifying Multi-Modal Autonomous Driving Data at Scale

123D:规模化统一多模态自动驾驶数据

Daniel Dauner, Valentin Charraut, Bastian Berle, Tianyu Li, Long Nguyen, Jiabao Wang, Changhui Jing, Maximilian Igl, Holger Caesar, Boris Ivanovic, Yiyi Liao, Andreas Geiger, Kashyap Chitta

发表机构 * KE:SAI University of Tübingen(图宾根大学) Tübingen AI Center(图宾根人工智能中心) NVIDIA Research(NVIDIA研究) Valeo Brain(法雷奥大脑) OpenDriveLab at Shanghai Innovation Institute(上海创新研究院自动驾驶实验室) Zhejiang University(浙江大学) Delft University of Technology(代尔夫特理工大学)

AI总结 123D通过单一API统一多模态自动驾驶数据,解决数据碎片化、同步难题,并提供分析工具,支持跨数据集3D目标检测和强化学习规划应用。

详情
AI中文摘要

自动驾驶研究产生了最丰富的传感器数据集,但其规模和多样性仍未被充分利用。每个数据集采用不同2D和3D模态,如相机、激光雷达、车辆状态、注释、交通灯和HD地图,具有不同的采样率和同步方案。这些数据以碎片化格式存在,需要复杂的依赖关系,无法在相同开发环境中原生共存。此外,注释惯例的主要不一致阻碍了跨数据集的训练或泛化评估。我们提出123D,一个开源框架,通过单一API统一多模态驾驶数据。为处理同步问题,我们存储每个模态作为独立的带时间戳的事件流,无预设速率,允许在任意数据集上同步或异步访问。使用123D,我们整合了八个真实世界驾驶数据集,涵盖3,300小时和90,000公里的数据,以及一个可配置的合成数据集,并提供数据分析和可视化工具。我们进行系统研究,比较注释统计并评估每个数据集的姿态和校准精度。进一步,我们展示123D支持的两个应用:跨数据集3D目标检测迁移和强化学习规划,并提供未来方向的建议。代码和文档可在https://github.com/kesai-labs/py123d获取。

英文摘要

The pursuit of autonomous driving has produced one of the richest sensor data collections in all of robotics. However, its scale and diversity remain largely untapped. Each dataset adopts different 2D and 3D modalities, such as cameras, lidar, ego states, annotations, traffic lights, and HD maps, with different rates and synchronization schemes. They come in fragmented formats requiring complex dependencies that cannot natively coexist in the same development environment. Further, major inconsistencies in annotation conventions prevent training or measuring generalization across multiple datasets. We present 123D, an open-source framework that unifies such multi-modal driving data through a single API. To handle synchronization, we store each modality as an independent timestamped event stream with no prescribed rate, enabling synchronous or asynchronous access across arbitrary datasets. Using 123D, we consolidate eight real-world driving datasets spanning 3,300 hours and 90,000 kilometers, together with a synthetic dataset with configurable collection scripts, and provide tools for data analysis and visualization. We conduct a systematic study comparing annotation statistics and assessing each dataset's pose and calibration accuracy. Further, we showcase two applications 123D enables: cross-dataset 3D object detection transfer and reinforcement learning for planning, and offer recommendations for future directions. Code and documentation are available at https://github.com/kesai-labs/py123d.

2605.08075 2026-05-11 cs.LG eess.AS

Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

通过想象到聆听的MEG映射实现零样本想象语音解码

Maryam Maghsoudi, Shihab Shamma

发表机构 * University of Maryland College Park(马里兰大学学院市)

AI总结 本文提出利用聆听语音记录提高想象语音解码的可靠性,通过三阶段流程揭示神经活动与语音刺激的关系,证明想象语音可显著高于随机水平解码。

详情
AI中文摘要

从非侵入性脑记录中解码想象语音具有挑战性,因为想象数据集稀缺且难以跨受试者和会话对齐。本文提出了一种新的解码方法,利用聆听语音时更丰富且可靠标注的记录。我们收集了受过训练的音乐家对节奏旋律和说话刺激的聆听和想象MEG记录。使用受过训练的音乐家有助于改善跨条件的时间对齐。我们随后开发了一个三阶段解码流程,揭示了想象和聆听相同刺激时神经活动之间的一致且有意义的关系。首先,我们训练了六个线性和神经模型,将想象MEG响应映射到聆听响应。我们评估这些模型与未见过的受试者的基线对比,以验证预测的聆听响应保留了刺激特定信息。在第二阶段,我们训练了一个对比词解码器,仅在聆听MEG响应上进行训练,并使用四种嵌入策略进行评估,包括语义、声音和语音表示。在第三阶段,我们通过映射流程处理来自留出受试者的想象MEG响应,计算相应的聆听响应,然后由聆听解码器解码。使用基于排名的分析,我们证明想象的词可以显著高于随机水平解码。我们在此报告了一个概念验证实现的解码结果,所有评估均在留出受试者上进行。我们还展示了随着训练数据量的增加,性能提高,表明这种方法可扩展并可直接应用于现实的脑机接口场景。

英文摘要

Decoding imagined speech from non-invasive brain recordings is challenging because imagined datasets are scarce and difficult to align temporally across subjects and sessions In this work, we propose a new approach to the decoding of imagined speech that leverages the richer and more reliably labeled recordings during listening to speech. We collected paired listened and imagined MEG recordings to rhythmic melodic and spoken stimuli from trained musicians. Using trained musicians helped improve temporal alignment across conditions. We then developed a three-stage decoding pipeline that revealed consistent and meaningful relationships between neural activity evoked by imagining and listening to the same stimuli. First, we trained six linear and neural models to map imagined MEG responses to listened responses. We evaluated these models against a null baseline from unseen subjects to validate that the predicted-listening responses preserve stimulus-specific information. In the second stage, we trained a contrastive word decoder exclusively on the listened MEG responses, and evaluated it using four embedding strategies including semantic, acoustic, and phonetic representations. In the third stage, we process the imagined MEG responses from held-out subjects through the mapping pipeline to compute the corresponding listening responses that are then decoded by the listened decoder. Using rank-based analysis, we show that the imagined words are decodable significantly above chance. We shall report here the results of a proof-of-concept implementation to decode imagined speech, where all evaluations are performed on held-out subjects. We also demonstrate that performance improves with training data size, suggesting that this approach is scalable and can directly be made applicable to realistic brain-computer interface scenarios.

2605.08074 2026-05-11 cs.LG

GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

GRAPHLCP:图结构感知的局部置信预测

Peyman Baghershahi, Fangxin Wang, Debmalya Mandal, Sourav Medya

发表机构 * Department of Computer Science, University of Illinois Chicago(伊利诺伊大学芝加哥分校计算机科学系) Department of Computer Science, University of Warwick(沃里克大学计算机科学系)

AI总结 GRAPHLCP提出一种基于图拓扑的局部置信预测框架,通过特征感知密集化和个性化PageRank核计算,有效捕捉局部和长距离依赖,实现有限样本下的高覆盖性预测。

Comments 20 pages, 9 Figures, 8 Tables

详情
AI中文摘要

置信预测(CP)提供了一种无分布的不确定性量化方法,具有有限样本保证。然而,将其应用于图神经网络(GNN)仍具挑战性,因为图的组合性质常导致预测不够确定和嵌入不可区分。现有方法主要依赖嵌入空间接近度进行局部化,这在图上可能不可靠且导致预测集低效。我们提出GRAPHLCP,一种基于接近度的局部化CP框架,明确将图拓扑和节点间依赖纳入局部化和加权中。我们的方法引入了特征感知的密集化步骤以缓解稀疏图中的局部性偏差,随后通过基于个性化PageRank的核计算来建模结构接近度。这使拓扑依赖的锚点采样和校准加权能够捕捉局部和长距离依赖。在多个回归和分类数据集上的广泛实验表明,GRAPHLCP在有限样本下保证边缘覆盖,同时在各种条件场景中高效获得有利的测试条件覆盖。

英文摘要

Conformal prediction (CP) provides a distribution-free approach to uncertainty quantification with finite-sample guarantees. However, applying CP to graph neural networks (GNNs) remains challenging as the combinatorial nature of graphs often leads to insufficiently certain predictions and indiscriminative embeddings. Existing methods primarily rely on embedding-space proximity for localization, which can be unreliable for graphs and yield inefficient prediction sets. We propose GRAPHLCP, a proximity-based localized CP framework that explicitly incorporates graph topology and inter-node dependencies into localization and weighting. Our approach introduces a feature-aware densification step to mitigate locality bias in sparse graphs, followed by a Personalized PageRank-based kernel computation to model structural proximity. This enables topology-dependent anchor sampling and calibration weighting that captures both local and long-range dependencies. Extensive experiments on several regression and classification datasets demonstrate that GRAPHLCP guarantees marginal coverage with finite samples while efficiently attaining favorable test conditional coverage across various conditioning scenarios.

2605.08073 2026-05-11 cs.CV cs.AI

EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

EmambaIR:用于事件引导图像重建的高效视觉状态空间模型

Wei Yu, Yunhang Qian

发表机构 * Harbin Institute of Technology(哈尔滨理工大学)

AI总结 本文提出EmambaIR,一种高效的视觉状态空间模型,通过跨模态Top-k稀疏注意力模块和门控状态空间模块,提升事件流图像重建的效率与性能,实现更高效的全局上下文捕捉。

详情
AI中文摘要

近期基于事件的图像重建方法主要依赖卷积神经网络(CNNs)和视觉变换器(ViTs)来处理互补的事件信息。然而,这些架构面临根本性的局限:CNNs往往无法捕捉全局特征相关性,而ViTs导致二次计算复杂度(例如O(n²)),阻碍了其在高分辨率场景中的应用。为解决这些瓶颈,我们引入EmambaIR,一种高效的视觉状态空间模型,用于使用空间稀疏和时间连续的事件流进行图像重建。我们的框架引入了两个关键组件:跨模态Top-k稀疏注意力模块(TSAM)和门控状态空间模块(GSSM)。TSAM高效地执行像素级Top-k稀疏注意力,以引导跨模态交互,产生丰富但稀疏的融合特征。随后,GSSM利用非线性门控单元来增强常规线性复杂度(O(n))SSMs的时间表示,有效捕捉全局上下文依赖性,而无需典型计算开销。在六个数据集上针对三种不同的图像重建任务——运动模糊消除、去雨和高动态范围(HDR)增强——进行了广泛的实验,结果表明EmambaIR在性能上显著优于现有最先进方法,同时在内存消耗和计算成本上实现了显著的减少。源代码和数据可在:https://github.com/YunhangWickert/EmambaIR 公开获取。

英文摘要

Recent event-based image reconstruction methods predominantly rely on Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to process complementary event information. However, these architectures face fundamental limitations: CNNs often fail to capture global feature correlations, whereas ViTs incur quadratic computational complexity (e.g., $O(n^2)$), hindering their application in high-resolution scenarios. To address these bottlenecks, we introduce EmambaIR, an Efficient visual State Space Model designed for image reconstruction using spatially sparse and temporally continuous event streams. Our framework introduces two key components: the cross-modal Top-k Sparse Attention Module (TSAM) and the Gated State-Space Module (GSSM). TSAM efficiently performs pixel-level top-k sparse attention to guide cross-modal interactions, yielding rich yet sparse fusion features. Subsequently, GSSM utilizes a nonlinear gated unit to enhance the temporal representation of vanilla linear-complexity ($O(n)$) SSMs, effectively capturing global contextual dependencies without the typical computational overhead. Extensive experiments on six datasets across three diverse image reconstruction tasks - motion deblurring, deraining, and High Dynamic Range (HDR) enhancement - demonstrate that EmambaIR significantly outperforms state-of-the-art methods while offering substantial reductions in memory consumption and computational cost. The source code and data are publicly available at: https://github.com/YunhangWickert/EmambaIR

2605.08070 2026-05-11 cs.AI

VecCISC: Improving Confidence-Informed Self-Consistency with Reasoning Trace Clustering and Candidate Answer Selection

VecCISC:通过推理轨迹聚类和候选答案选择改进置信度引导的自一致性

James Petullo, Sonny George, Dylan Cashman, Nianwen Xue

发表机构 * Computer Science Department, Brandeis University(布拉德雷大学计算机科学系)

AI总结 VecCISC通过推理轨迹聚类和候选答案筛选优化置信度引导的自一致性方法,减少计算开销并保持高精度。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

一种标准的扩展推理时间推理技术是自一致性,其中从LLM中采样多个候选答案,然后选择最常见的答案。最近研究表明,加权多数投票(如置信度引导的自一致性(CISC))通过为每个候选答案分配置信度值并选择累积分数最高的答案,在广泛流行的基准上更准确。在实践中,加权多数投票需要为每个候选人的推理轨迹调用批评LLM以生成答案的置信度评分。这一二次LLM调用显著增加了加权多数投票的开销和成本,尽管其潜在性能优势。为减少此费用,我们提出了VecCISC,一种轻量级、自适应的框架,利用语义相似性度量来过滤语义等价、退化或虚构的推理轨迹,从而减少需由批评者评估的候选答案数量。为确保充分的实验彻底性,我们在五个具有挑战性的、广泛采用的数据集上评估VecCISC,涵盖数学、化学、生物学、常识推理和人文学科领域。我们的结果表明,VecCISC将总token使用量减少了47%,同时保持或超过了CISC的准确性。

英文摘要

A standard technique for scaling inference-time reasoning is Self-Consistency, whereby multiple candidate answers are sampled from an LLM and the most common answer is selected. More recently, it has been shown that weighted majority voting (e.g. Confidence-Informed Self Consistency (CISC)), which assigns a confidence value to each candidate answer and chooses the answer with the largest accumulated score, tends to be more accurate on a wide range of popular benchmarks. In practice, weighted majority voting necessitates calling a critic LLM on each candidate's reasoning trace to produce the answer's confidence score. This secondary series of LLM calls greatly increases the overhead and cost of weighted majority voting, despite its potential performance benefits. To reduce this expense, we propose VecCISC, a lightweight, adaptive framework that uses a measure of semantic similarity to filter reasoning traces that are semantically equivalent to others, degenerate, or hallucinated, thus decreasing the number of candidate answers that must be evaluated by the critic. To ensure adequate experimental thoroughness, we evaluate VecCISC on five challenging, widely-adopted datasets spanning the domains of mathematics, chemistry, biology, commonsense reasoning, and the humanities. Our results demonstrate that VecCISC reduces the total token usage by 47%, while maintaining or exceeding the accuracy of CISC.

2605.08064 2026-05-11 cs.CV

Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

Proxy3D: 通过语义聚类和对齐实现视觉-语言模型的高效3D表示

Jerry Jiang, Haowen Sun, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno, Kurt Keutzer, Wenzhao Zheng

发表机构 * Tsinghua University(清华大学) Panasonic AI Lab(松下人工智能实验室) Panasonic DX-CPS(松下DX-CPS) UC Berkeley(伯克利大学)

AI总结 本文提出Proxy3D方法,通过语义和几何编码器提取场景特征并进行语义感知聚类,生成紧凑且全面的3D代理表示,提升视觉-语言模型在3D视觉问答、视觉定位和空间智能任务中的性能。

Comments Accepted by CVPR 2026. Project page: https://wzzheng.net/Proxy3D

详情
AI中文摘要

空间智能在视觉-语言模型(VLMs)中吸引了研究兴趣,因为需要在3D世界中推理的实用需求。尽管有令人鼓舞的结果,大多数现有方法仍遵循VLMs中的传统2D管道,并使用像素对齐的表示来处理视觉模态。然而,基于对应关系的模型通常缺乏隐含的3D场景理解,而基于表示的模型在视觉序列串接中效率低下。为此,我们提出了Proxy3D方法,通过紧凑且全面的3D代理表示来处理视觉模态。仅给定视频帧作为输入,我们使用语义和几何编码器提取场景特征,然后进行语义感知聚类以获得3D空间中的代理集合。对于表示对齐,我们进一步整理了SpaceSpan数据集,并应用多阶段训练以采用所提出的3D代理表示与VLM。当使用较短的序列进行视觉信息时,我们的方法在3D视觉问答、视觉定位和通用空间智能基准上实现了竞争性或最先进的性能。

英文摘要

Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.

2605.08061 2026-05-11 cs.AI

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

基于评分标准的强化学习:用于通用推理的结构化评判奖励

Manish Bhattarai, Ismael Boureima, Nishath Rajiv Ranasinghe, Scott Pakin, Dan O'Malley

发表机构 * Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)

AI总结 本文提出基于评分标准的强化学习框架,通过LLM评判器生成多标准奖励,提升模型在通用推理任务中的表现。

详情
AI中文摘要

我们主张将奖励分解为加权、可验证的标准,并利用LLM评判器对这些标准进行评分,提供部分信用优化信号:而不是二元结果或单一整体评分,每个响应会根据多个任务特定的标准进行评分。我们正式化了基于评分标准的强化学习(RL):一种框架,其中策略在由冻结的LLM评判器生成的结构化、多标准奖励下进行优化,该评判器基于辅助信息对策略从未见过的领域进行条件判断。我们通过从科学和技术信息办公室(OSTI)衍生的约10万份科学和技术文档语料库中推导出评分标准,并使用Group Relative Policy Optimization(GRPO)训练Llama-3.1-8B-Instruct来实例化该框架。通过GRPO训练,模型在保留的评分评估中实现了71.7%的归一化奖励。GRPO调优的策略在四个未从训练语料库中衍生的推理基准测试中也优于基础模型——GSM8K、MATH、GPQA Main和GPQA Diamond。这些结果提供了证据,表明结构化、文档基础的奖励可以提高保留的评分表现,并诱导出超出训练环境所用语料库的可转移推理行为。

英文摘要

We argue that decomposing reward into weighted, verifiable criteria and using an LLM judge to score them provides a partial-credit optimization signal: instead of a binary outcome or a single holistic score, each response is graded along multiple task-specific criteria. We formalize \emph{rubric-grounded reinforcement learning (RL)}: a framework in which the policy is optimized against a structured, multi-criterion reward produced by a frozen LLM judge that conditions on auxiliary grounding the policy never sees. We instantiate the framework by deriving rubrics from an Office of Scientific and Technical Information (OSTI)-derived corpus of roughly 100,000 scientific and technical documents and training Llama-3.1-8B-Instruct with Group Relative Policy Optimization (GRPO). With GRPO-based training, the model achieves $71.7\%$ normalized reward on held-out rubric evaluation. The GRPO-tuned policy also improves over the base model on four reasoning benchmarks not derived from the training corpus -- GSM8K, MATH, GPQA Main, and GPQA Diamond. These results provide evidence that structured, document-grounded rewards can improve held-out rubric performance and induce transferable reasoning behaviors beyond the corpus used to construct the training environment.

2605.08060 2026-05-11 cs.CL cs.AI cs.GT cs.MA

The Memory Curse: How Expanded Recall Erodes Cooperative Intent in LLM Agents

记忆诅咒:扩展回忆如何在LLM代理中侵蚀合作意图

Jiayuan Liu, Tianqin Li, Shiyi Du, Xin Luo, Haoxuan Zeng, Emanuel Tewolde, Tai Sing Lee, Tonghan Wang, Carl Kingsford, Vincent Conitzer

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Foundations of Cooperative AI Lab (FOCAL)(合作人工智能基础实验室) University of Michigan(密歇根大学) Harvard University(哈佛大学)

AI总结 研究发现扩展回忆会系统性地削弱多代理社会困境中的合作意图,通过三种分析揭示记忆内容对合作的影响,证明记忆是影响多代理行为的主动因素。

详情
AI中文摘要

扩展上下文窗口常被视为LLM能力的简单升级,但我们发现其在多代理社会困境中系统性失效。在7个LLM和4个游戏超过500轮的测试中,扩展可访问历史会破坏18种模型-游戏设置中的合作,我们将其称为记忆诅咒。通过三种分析揭示其机制:首先,对378,000条推理轨迹的词频分析表明,这种失效与侵蚀前瞻性意图有关,而非上升的偏执。我们通过针对性微调作为认知探针验证:仅在前瞻性轨迹上训练的LoRA适配器可缓解衰减并转移至不同游戏。其次,记忆净化固定提示长度,同时用合成合作记录替换可见历史,显著恢复合作,证明触发因素是记忆内容而非长度本身。最后,消去显式推理链往往减少崩溃,显示推理反而放大了记忆诅咒。共同,这些结果将记忆重新定义为多代理行为的主动决定因素:更长的回忆可根据其引发的推理模式稳定或支持合作。

英文摘要

Context window expansion is often treated as a straightforward capability upgrade for LLMs, but we find it systematically fails in multi-agent social dilemmas. Across 7 LLMs and 4 games over 500 rounds, expanding accessible history degrades cooperation in 18 of 28 model--game settings, a pattern we term the memory curse. We isolate the underlying mechanism through three analyses. First, lexical analysis of 378,000 reasoning traces associates this breakdown with eroding forward-looking intent rather than rising paranoia. We validate this using targeted fine-tuning as a cognitive probe: a LoRA adapter trained exclusively on forward-looking traces mitigates the decay and transfers zero-shot to distinct games. Second, memory sanitization holds prompt length fixed while replacing visible history with synthetic cooperative records, which restores cooperation substantially, proving the trigger is memory content, not length alone. Finally, ablating explicit Chain-of-Thought reasoning often reduces the collapse, showing that deliberation paradoxically amplifies the memory curse. Together, these results recast memory as an active determinant of multi-agent behavior: longer recall can either destabilize or support cooperation depending on the reasoning patterns it elicits.

2605.08059 2026-05-11 cs.CV cs.RO

6D Pose Estimation via Keypoint Heatmap Regression with RGB-D Residual Neural Networks

通过RGB-D残差神经网络进行6D位姿估计

Ismail Aljosevic, Amir Masoud Almasi, Ana Parovic, Ashkan Shafiei

发表机构 * Politecnico di Torino(托里诺理工学院)

AI总结 本文提出基于关键点热图回归的模块化框架,结合YOLOv10m进行目标检测与ResNet18预测RGB图像的2D热图,利用PnP RANSAC算法估计6D位姿,并通过RGB-D交叉融合架构提升性能,最终在LINEMOD数据集上达到92.41%的准确率。

Comments Source code available at: https://github.com/ameermasood/HeatNet

详情
AI中文摘要

本文提出了一种基于关键点热图回归的模块化框架,用于6D位姿估计。我们的方法结合YOLOv10m进行目标检测,并使用基于ResNet18的网络从RGB图像预测2D热图。从这些热图中提取的关键点通过PnP RANSAC算法估计6D物体位姿。我们比较了不同的关键点选择策略,以评估其对位姿精度的影响。此外,我们通过交叉融合架构将深度数据纳入基线中,从而在多个阶段实现RGB和深度特征的交互。我们还探索了通用训练改进,如实验不同的激活函数和学习率调度策略以提高模型性能。我们的最佳RGB-only模型在LINEMOD数据集上达到了84.50%的平均ADD准确率,而RGB-D融合模型达到了92.41%的准确率。代码可在https://github.com/ameermasood/HeatNet上获得。

英文摘要

In this paper, we propose a modular framework for 6D pose estimation based on keypoint heatmap regression. Our approach combines YOLOv10m for object detection with a ResNet18-based network that predicts 2D heatmaps from RGB images. Keypoints extracted from these heatmaps are used to estimate the 6D object pose via the PnP RANSAC algorithm. We compare different keypoint selection strategies to assess their impact on pose accuracy. Additionally, we extend the baseline by incorporating depth data using a cross-fusion architecture, which enables interaction between RGB and depth features at multiple stages. We further explore general training improvements, such as experimenting with activation functions and learning rate scheduling strategies to improve model performance. Our best RGB-only model achieved a mean ADD-based accuracy of 84.50%, while the RGB-D fusion model reached 92.41% on the LINEMOD dataset. The code is available at https://github.com/ameermasood/HeatNet.

2605.08057 2026-05-11 cs.CL cs.AI

CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation

CA-SQL:基于探索与计算预算分配的文本到SQL推理方法

James Petullo, Nianwen Xue

发表机构 * Brandeis University(布雷纳大学)

AI总结 CA-SQL通过动态扩展探索广度和自定义提示种子方法,提升文本到SQL任务在Bird-Bench基准中的表现,达到51.72%的挑战性问题得分。

详情
AI中文摘要

尽管最近的推理时间学习进步提升了LLM在文本到SQL任务中的推理能力,但当前解决方案在Bird-Bench(BIRD)基准中最具挑战性的任务上仍表现不佳。这归因于探索解决方案空间不足,无法发现有前途的候选查询以进一步细化为正确输出。为解决这一挑战,我们引入了CA-SQL,一种新的文本到SQL流水线,利用任务估计难度动态调整探索广度以生成解决方案候选。此外,我们使用基于进化搜索原理的自定义提示种子方法,进一步激发基础LLM的探索行为,并采用新的投票方法在搜索结束时选择最佳候选解决方案。实验表明,我们的解决方案仅使用GPT-4o-mini就实现了BIRD开发集“挑战性”层级问题的51.72%得分,优于其他上下文学习方法,即使那些利用更大模型的方法。总体而言,我们的方法在BIRD开发数据集上达到了竞争性的61.06%执行准确率和68.77%软F1得分。

英文摘要

While recent advancements in inference-time learning have improved LLM reasoning on Text-to-SQL tasks, current solutions still struggle to perform well on the most challenging tasks in the Bird-Bench (BIRD) benchmark. This is due to inadequate solution space exploration, which is necessary to uncover promising candidate queries that can be further refined to produce the correct output. To address this challenge, we introduce CA-SQL, a novel Text-to-SQL pipeline that utilizes the estimated difficulty of a task to dynamically scale the breadth of the exploration for generating solution candidates. In addition, we use a custom prompt seeding method, based on principles of evolutionary search, to further elicit exploratory behavior from the base LLM and a novel voting method to select the best candidate solution at the end of the search. Experiments demonstrate that our solution achieves a state-of-the-art score of 51.72% on the "challenging" tier of BIRD development set problems, using only GPT-4o-mini, out-performing other in-context learning approaches, even those that leverage larger models. Overall, our method attains a competitive 61.06% execution accuracy and 68.77% Soft F1 score on the BIRD development dataset.

2605.08054 2026-05-11 cs.CV

Towards Highly-Constrained Human Motion Generation with Retrieval-Guided Diffusion Noise Optimization

面向高约束人类动作生成的检索引导扩散噪声优化

Hanchao Liu, Fang-Lue Zhang, Shining Zhang, Tai-Jiang Mu, Shi-Min Hu

发表机构 * Tsinghua University(清华大学) University of New South Wales(新南威尔士大学)

AI总结 本文提出基于无训练扩散噪声优化框架的检索引导方法,通过关系任务解析和奖励引导掩码优化,解决高约束下的动作生成问题,提升虚拟代理行为合成能力。

Comments Accepted to CVPR2026

详情
AI中文摘要

生成满足定制化零样本目标函数的人类动作,以应用于可控角色动画和虚拟代理行为合成,是关键能力。尽管当前方法处理许多未见约束,但难以处理具有极难时空限制的任务,如严重空间障碍或指定步数。为使动作生成器胜任这些高约束任务,我们提出基于无训练扩散噪声优化框架的检索引导方法。关键思想是搜索大规模动作数据集以获取可能满足困难约束的指导。我们引入关系任务解析来分组目标约束并识别需由检索参考处理的困难约束。通过结合随机噪声与检索噪声的奖励引导掩码,获得更好的扩散噪声初始化。通过从该改进初始化优化扩散噪声,我们成功解决了高约束生成任务。通过利用LLM进行关系任务解析,整个框架进一步能够自动推理应检索什么,提升在无训练优化方案下的虚拟代理智能性。

英文摘要

Generating human motion that satisfies customized zero-shot goal functions, enabling applications such as controllable character animation and behavior synthesis for virtual agents, is a critical capability. While current approaches handle many unseen constraints, they fail on tasks with very challenging spatiotemporal restrictions, such as severe spatial obstacles or specified numbers of walking steps. To equip motion generators for these highly constrained tasks, we present a retrieval-guided method built on the training-free diffusion noise optimization framework. The key idea is to search within large motion datasets for guidance that can potentially satisfy difficult constraints. We introduce relational task parsing to group target constraints and identify the difficult ones to be handled by retrieved reference. A better initialization for diffusion noise is then obtained via a reward-guided mask that combines random noise with retrieved noise. By optimizing diffusion noise from this improved initialization, we successfully solve highly constrained generation tasks. By leveraging LLM for relational task parsing, the whole framework is further enabled to automatically reason for what to retrieve, improving the intelligence of moving agents under a training-free optimization scheme.

2605.08053 2026-05-11 cs.LG

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

基于指数效用的强化学习:算法与折扣MDP中的收敛性

Gugan Thoppe, L. A. Prashanth, Ankur Naskar, Sanjay Bhat

发表机构 * Computer Science and Automation, Indian Institute of Science, Bengaluru(印度科学研究院计算机科学与自动化部门,班加罗尔) Computer Science and Engineering, Indian Institute of Technology Madras, Chennai(印度理工学院马德拉斯分校计算机科学与工程系,钦奈) Tata Consultancy Services Limited, Hyderabad(海得拉巴市塔塔咨询公司)

AI总结 本文研究了在折扣马尔可夫决策过程中的指数效用优化强化学习问题,提出两种无模型算法,并证明了其收敛性,为指数效用目标下的价值型强化学习提供了理论基础。

详情
AI中文摘要

强化学习(RL)在折扣马尔可夫决策过程(MDPs)中的指数效用优化缺乏系统性的价值型算法。本文在固定风险厌恶设定下填补了这一空白。基于Porteus在1975年研究的指数效用的Bellman型方程,我们推导出两种Q值扩展形式,并证明相关算子在$ L_\infty$和sup-log/Thompson度量下为收缩算子。我们刻画了它们的不动点,并证明由此诱导的贪心 stationary策略在 stationary 策略中为指数效用目标的最优策略。这些结构结果导致了两种无模型算法:一种是双时间尺度Q-learning型算法,我们建立了几乎确定收敛性,并通过时间尺度分离提供了有限时间收敛率;另一种是一时间尺度算法,由子线性幂律算子驱动。由于后者在标准度量下不具有全局收缩性,我们通过基于局部Lipschitz性、单调性、齐次性和Dini导数的精细论证证明其收敛性,并提供标量有限时间分析,突显了在向量情况获得收敛率的挑战。本文为指数效用目标下的价值型强化学习提供了理论基础。

英文摘要

Reinforcement learning (RL) for exponential-utility optimization in discounted Markov decision processes (MDPs) lacks principled value-based algorithms. We address this gap in the fixed risk-aversion setting. Building on the Bellman-type equation for exponential utility studied in \cite{porteus1975optimality}, we derive two Q-value-style extensions and show that the associated operators are contractions in the $L_\infty$ and sup-log/Thompson metrics, respectively. We characterize their fixed points and prove that the induced greedy stationary policy is optimal for the exponential-utility objective among stationary policies. These structural results lead to two model-free algorithms: a two-timescale Q-learning--style algorithm, for which we establish almost-sure convergence and provide finite-time convergence rates via timescale separation, and a one-timescale algorithm governed by a sublinear power-law operator. Since the latter does not admit a global contraction in standard metrics, we prove its convergence using delicate arguments based on local Lipschitzness, monotonicity, homogeneity, and Dini derivatives, and provide a scalar finite-time analysis that highlights the challenges in obtaining convergence rates in the vector case. Our work provides a foundation for value-based RL under exponential-utility objectives.

2605.08050 2026-05-11 cs.CV

MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

MoCoTalk: 多条件扩散与自适应路由的可控说话头生成

Xinyan Ye, Jiankang Deng, Abbas Edalat

发表机构 * Imperial College London(帝国理工学院伦敦分校)

AI总结 MoCoTalk通过统一身份、头部姿态、面部表情和嘴部动态四个控制信号,提出自适应多条件路由框架,解决异构条件干扰问题,提升可控说话头生成的性能与可控性。

详情
AI中文摘要

说话头生成需要联合建模身份、头部姿态、面部表情和嘴部动态。现有方法通常只处理部分因素,且在多个条件时依赖固定权重或启发式融合。我们提出MoCoTalk,一种多条件视频扩散框架,统一四个互补的控制信号:参考图像、面部关键点、3DMM渲染的阴影网格和对应的语音音频。为解决异构条件间的破坏性干扰,我们引入自适应多条件路由,计算通道-wise、时间步-aware的门控,允许融合策略随特征子空间和噪声水平变化。为更好地捕捉语音相关的面部动态,我们设计了唇部增强的阴影网格,一种基于3DMM的表示,解耦头部运动、嘴部运动、表情和光照。此设计提供时间一致的几何先验,并允许在推理时灵活重组这些属性。我们进一步引入唇部一致性损失以收紧音视频对齐。大量实验表明,MoCoTalk在多数结构、运动和感知指标上达到最先进的性能,同时提供单条件方法无法提供的属性级可控性。

英文摘要

Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples head motion, mouth motion, expression, and lighting. This design provides a temporally consistent geometric prior and allows flexible recombination of these attributes at inference. We further introduce a lip consistency loss to tighten audio-visual alignment. Extensive experiments show that MoCoTalk achieves state-of-the-art performance on the majority of structural, motion, and perceptual metrics, while offering attribute-level controllability that single-condition methods do not provide.

2605.08048 2026-05-11 cs.CL

Accurate and Efficient Statistical Testing for Word Semantic Breadth

精确且高效的词义广度统计检验

Yo Ehara

发表机构 * Tokyo Gakugei University(东京大学)

AI总结 本文提出一种基于Householder对齐的置换检验方法,用于区分词义广度与方向差异,降低I类错误并提升效率。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

本文提出一种基于Householder对齐的置换检验方法,用于区分词义广度与方向差异,降低I类错误并提升效率。

英文摘要

Measuring the breadth of a word's meaning, or its spread across contexts, has become feasible with contextualized token embeddings. A word type can be represented as a cloud of token vectors, with dispersion-based statistics serving as proxies for contextual diversity (Nagata and Tanaka-Ishii, ACL2025). These measurements are useful for deciding appropriate sense distinctions when constructing thesauri and domain-specific dictionaries. However, when comparing the breadth of two word types, naive hypothesis testing on dispersion can be misleading: differences in semantic direction can masquerade as dispersion differences, inflating Type-I error and yielding "statistically significant" outcomes even when there is no true breadth difference. This is problematic because significance testing should distinguish genuine effects from incidental fluctuations in small-difference regimes. We propose a Householder-aligned permutation test to isolate dispersion differences from directional differences. Our method applies a single Householder reflection to align the mean directions of the two word types and then performs a permutation test on the aligned token clouds, yielding calibrated, non-parametric p-values. For practicality, we introduce a GPU-oriented implementation that batches permutations and linear algebra operations. Empirically, our alignment reduced Type-I error by 32.5% while preserving sensitivity to genuine breadth differences, and achieved a 23x speedup over the CPU baseline.

2605.08045 2026-05-11 cs.CL

Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

基于蒸馏LLM的全CMR报告不确定性感知结构化数据提取

Yi Yu, Parker Martin, Zhenyu Bu, Yixuan Liu, Yi-Yu Zheng, Orlando Simonetti, Yuchi Han, Yuan Xue

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 本文提出CMR-EXTR框架,通过蒸馏LLM将自由文本CMR报告转为结构化数据并提供字段置信度,实现无监督推理和不确定性整合,达到99.65%的变量级准确率。

Comments Accepted to ISBI 2026

详情
AI中文摘要

将自由文本心脏磁共振(CMR)报告转换为可审计的结构化数据仍是队列构建、纵向整理和临床决策支持的瓶颈。我们提出CMR-EXTR,一个轻量框架,将自由文本CMR报告转换为结构化数据并为每个字段分配置信度以进行质量控制。教师-学生蒸馏管道使完全离线推理成为可能,同时限制手动标注。不确定性整合了三个互补原则--分布可信度、采样稳定性、跨字段一致性--以优先处理人工审查。实验显示,CMR-EXTR在变量级达到99.65%的准确率,证明了可靠提取和信息丰富的置信度评分。据我们所知,这是首个集成置信度估计的特定CMR提取系统。代码可在https://github.com/yuyi1005/CMR-EXTR获取。

英文摘要

Converting free-text cardiac magnetic resonance (CMR) reports into auditable structured data remains a bottleneck for cohort assembly, longitudinal curation, and clinical decision support. We present CMR-EXTR, a lightweight framework that converts free-text CMR reports into structured data and assigns per-field confidence for quality control. A teacher-student distillation pipeline enables fully offline inference while limiting manual annotation. Uncertainty integrates three complementary principles -- distribution plausibility, sampling stability, and cross-field consistency -- to triage human review. Experiments show that CMR-EXTR achieves 99.65% variable-level accuracy, demonstrating both reliable extraction and informative confidence scores. To our knowledge, this is the first CMR-specific extraction system with integrated confidence estimation. The code is available at https://github.com/yuyi1005/CMR-EXTR.

2605.08044 2026-05-11 cs.CL cs.AI cs.LG

Fast Byte Latent Transformer

快速字节潜在变换器

Julie Kallini, Artidoro Pagnoni, Tomasz Limisiewicz, Gargi Ghosh, Luke Zettlemoyer, Christopher Potts, Xiaochuang Han, Srinivasan Iyer

发表机构 * FAIR at Meta(Meta的FAIR) Stanford University(斯坦福大学) University of Washington(华盛顿大学)

AI总结 本文提出BLT Diffusion和BLT Self-speculation等方法,通过并行生成和验证步骤提升字节级语言模型的生成速度和质量,降低内存带宽消耗。

详情
AI中文摘要

近期的字节级语言模型(LMs)在无需子词词汇的情况下可与标记级模型匹配性能,但其实用性受限于缓慢的逐字生成。我们通过新的训练和生成技术解决了这一瓶颈,首先引入BLT Diffusion(BLT-D),一种新的模型和最快的BLT变体,通过辅助的块级扩散目标与标准的下一字预测损失共同训练。这使推理过程能够在每个解码步骤中并行生成多个字节,显著减少生成序列所需的前向传递次数。其次,我们提出两种受推测解码启发的扩展,以牺牲部分速度换取更高的生成质量:BLT Self-speculation(BLT-S),其中BLT的局部解码器继续生成超过其正常补丁边界以草稿字节,然后通过单次完整模型前向传递验证;以及BLT Diffusion+Verification(BLT-DV),在扩散生成后增加一个自回归验证步骤。所有方法在生成任务中可能实现约50%的内存带宽成本降低。每种方法都有其独特优势,共同克服了字节级LMs的实际应用关键障碍。

英文摘要

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slow, byte-by-byte autoregressive generation. We address this bottleneck in the Byte Latent Transformer (BLT) through new training and generation techniques. First, we introduce BLT Diffusion (BLT-D), a new model and our fastest BLT variant, trained with an auxiliary block-wise diffusion objective alongside the standard next-byte prediction loss. This enables an inference procedure that generates multiple bytes in parallel per decoding step, substantially reducing the number of forward passes required to generate a sequence. Second, we propose two extensions inspired by speculative decoding that trade some of this speed for higher generation quality: BLT Self-speculation (BLT-S), in which BLT's local decoder continues generating past its normal patch boundaries to draft bytes, which are then verified with a single full-model forward pass; and BLT Diffusion+Verification (BLT-DV), which augments BLT-D with an autoregressive verification step after diffusion-based generation. All methods may achieve an estimated memory-bandwidth cost over 50% lower than BLT on generation tasks. Each approach offers its own unique advantages, together removing key barriers to the practical use of byte-level LMs.

2605.08043 2026-05-11 cs.CV cs.AI

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

SCOPE:结构分解与条件技能协调用于复杂图像生成

Tianfei Ren, Zhipeng Yan, Yiming Zhao, Zhen Fang, Yu Zeng, Guohui Zhang, Hang Xu, Xiaoxiao Ma, Shiting Huang, Ke Xu, Wenxuan Huang, Lionel Z. Wang, Lin Chen, Zehui Chen, Jie Huang, Feng Zhao

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China(脑启发智能感知与认知 MOE 实验室,中国科学技术大学) The Hong Kong Polytechnic University(香港理工大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出SCOPE框架,通过结构化规范和条件技能协调解决复杂图像生成中语义承诺的持续跟踪问题,通过Gen-Arena基准验证其有效性,取得优于基线的性能。

详情
AI中文摘要

尽管文本到图像模型在视觉保真度上取得了显著进展,但忠实实现复杂的视觉意图仍具挑战性,因为许多要求必须跨接地、生成和验证过程跟踪。我们称这些要求为语义承诺,并正式化其生命周期断续性为概念裂痕,其中承诺可能局部解决或检查,但无法在整个生成生命周期中保持可识别的同一操作单元。为了解决这个问题,我们提出了SCOPE,一个由规范引导的技能协调框架,通过演进的结构化规范维护语义承诺,并在未解决或违反承诺时有条件地调用检索、推理和修复技能。为了评估承诺层面的意图实现,我们引入了Gen-Arena,一个带有实体和约束级别规范的人工标注基准,以及实体门控意图通过率(EGIP),一个严格的先实体通过标准。SCOPE在Gen-Arena上显著优于所有评估基线,达到0.60 EGIP,并在WISE-V(0.907)和MindBench(0.61)上进一步取得强劲结果,证明了持续跟踪承诺对于复杂图像生成的有效性。

英文摘要

While text-to-image models have made strong progress in visual fidelity, faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift, where commitments may be locally resolved or checked but fail to remain identifiable as the same operational units throughout the generation lifecycle. To address this, we propose SCOPE, a specification-guided skill orchestration framework that maintains semantic commitments in an evolving structured specification and conditionally invokes retrieval, reasoning, and repair skills around unresolved or violated commitments. To evaluate commitment-level intent realization, we introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms all evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and further achieves strong results on WISE-V (0.907) and MindBench (0.61), demonstrating the effectiveness of persistent commitment tracking for complex image generation.

2605.08037 2026-05-11 cs.LG cs.AI

Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

超越成对:你的语言模型其实是在优化一个偏好图

Ning Liu, Chuanneng Sun, Kristina Klinkner, Shervin Malmasi

发表机构 * Amazon(亚马逊)

AI总结 本文提出GraphDPO,通过构建偏好图结构优化语言模型,解决传统成对优化在处理多轮次数据时的局限性,提升模型在推理和编程任务中的性能。

详情
AI中文摘要

直接偏好优化(DPO)通过成对偏好比较对语言模型进行对齐,提供了一种简单有效的替代强化学习(RL)从人类反馈的方法。然而,在许多实际场景中,训练数据包含每个提示的多个轮次,导致丰富的偏好结构,而传统的成对DPO无法利用这些信息。本文提出图直接偏好优化(GraphDPO),一种基于由轮次排名诱导的有向无环偏好图的DPO扩展方法。GraphDPO将支配关系编码为边,并优化一种基于Plackett-Luce启发式的图结构目标函数,通过图邻域聚合监督,强制传递性并恢复标准DPO为特殊情况。为处理离散或稀疏信号,引入等价类构造,相同偏好的响应形成图层,层内边贡献零损失,防止虚假梯度。尽管利用完整图结构,GraphDPO通过高效的log-sum-exp聚合保持线性每提示复杂度。进一步引入可选的真实解锚定,通过插入已验证的解决方案作为主导节点,并应用退火调度,稳定早期训练同时逐渐放松 oracle 监督。在推理和编程合成任务中的实验显示,GraphDPO表现优于传统方法,表明图结构偏好建模是可扩展且稳健的替代方案。

英文摘要

Direct Preference Optimization (DPO) aligns language models using pairwise preference comparisons, offering a simple and effective alternative to Reinforcement Learning (RL) from human feedback. However, in many practical settings, training data consists of multiple rollouts per prompt, inducing rich preference structure that pairwise DPO fails to exploit. Collapsing such data into independent pairs discards transitivity, introduces redundant or conflicting supervision, and can lead to unstable optimization. We propose Graph Direct Preference Optimization (GraphDPO), a principled generalization of DPO that operates over directed acyclic preference graphs induced by rollout rankings. GraphDPO encodes dominance relations as edges and optimizes a graph-structured Plackett--Luce-inspired objective that aggregates supervision over graph neighborhoods, enforcing transitivity while recovering standard DPO as a special case. To handle discrete or sparse signals, we introduce an equivalence-class construction where responses with identical preferences form graph layers, and intra-layer edges contribute zero loss, preventing spurious gradients. Despite leveraging full graph structure, GraphDPO maintains linear per-prompt complexity via efficient log-sum-exp aggregation. We further incorporate optional ground-truth anchoring by inserting verified solutions as dominant nodes and applying an annealed schedule that stabilizes early training while gradually relaxing oracle supervision. Experiments on reasoning and program synthesis tasks demonstrate superior performance, suggesting that graph-structured preference modeling is a scalable and robust alternative to pairwise and listwise alignment objectives.

2605.08036 2026-05-11 cs.LG

Don't Get Your Kroneckers in a Twist: Gaussian Processes on High-Dimensional Incomplete Grids

不要让克罗内克者困扰:高维不完整网格上的高斯过程

Mads Greisen Højlund, August Smart Lykke-Møller, Henry Moss, Ove Christiansen

发表机构 * Department of Chemistry(化学系) Aarhus University(奥胡斯大学) School of Mathematical Sciences(数学科学学院) Lancaster University(兰卡斯特大学)

AI总结 CUTS-GPR通过快速核矩阵-向量乘法实现高维高斯过程回归,展现线性或近线性可扩展性,用于高维势能面的贝叶斯建模。

Comments 51 pages, 8 figures

详情
AI中文摘要

我们介绍了CUTS-GPR,一种在高维设置中进行数值精确高斯过程回归(GPR)的新方法。CUTS-GPR的关键组件是一个极快的核矩阵-向量乘法,其规模与训练数据量$N$呈近线性或线性关系,与维度$D$呈低阶多项式关系。这通过结合加法核与不完整网格并利用由此产生的核矩阵结构来实现。我们通过运行数十亿数据点和数千维的基准测试来展示矩阵-向量乘法的可扩展性。对于$N = 447265$和$D = 24$,完整的GPR计算,包括超参数优化,在几小时内完成。我们证明了CUTS-GPR能够实现高维势能面的贝叶斯建模,这是计算化学中长期存在的挑战。

英文摘要

We introduce CUTS-GPR, a new method for performing numerically exact Gaussian process regression (GPR) in high-dimensional settings. The key component of CUTS-GPR is an extremely fast kernel matrix-vector product, which exhibits near-linear or even linear scaling with the amount of training data, $N$, and low-order polynomial scaling with dimensionality, $D$. This is obtained by combining an additive kernel with an incomplete grid and exploiting the resulting structure of the kernel matrix. We demonstrate the scalability of the matrix-vector product by running benchmarks with billions of data points and thousands of dimensions. Full GPR calculations, including hyperparameter optimization, are completed in a matter of hours for $N = 447 265$ and $D = 24$. We demonstrate that our CUTS-GPR enables Bayesian modeling of high-dimensional potential energy surfaces - a longstanding challenge in computational chemistry.

2605.08031 2026-05-11 cs.CV

Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models

面向视觉语言模型的无对象幻觉强化反学习

Kaidi Jia, Yujie Lin, Chengyi Yang, Jiayao Ma, Jinsong Su

发表机构 * Xiamen University(厦门大学)

AI总结 本文提出HFRU框架,通过视觉编码器深度语义删除,结合对齐破坏与GRPO优化,实现高效反学习,减少对象幻觉,实验显示在目标识别和人脸识别任务中性能优异。

详情
AI中文摘要

视觉语言模型(VLMs)因隐私、版权和偏见问题引发了对机器反学习的需求。然而现有方法仅微调语言解码器,导致表面遗忘无法消除底层视觉表示且常引入对象幻觉。本文提出HFRU,一种基于视觉编码器的强化反学习框架,采用两阶段方法结合对齐破坏与GRPO优化,使用复合奖励包括抽象奖励以鼓励语义有效替代并缓解幻觉。在目标识别和人脸识别任务中,HFRU实现了超过98%的遗忘和保留性能,同时引入极小的对象幻觉,显著优于先前方法。代码和实现细节可在https://github.com/XMUDeepLIT/HFRU获取。

英文摘要

Vision-language models (VLMs) raise growing concerns about privacy, copyright, and bias, motivating machine unlearning to remove sensitive knowledge. However, existing methods primarily fine-tune the language decoder, leading to superficial forgetting that fails to erase underlying visual representations and often introduces object hallucination. We propose HFRU, a reinforcement unlearning framework that operates on the vision encoder for deep semantic removal. Our two-stage approach combines alignment disruption with GRPO-based optimization using a composite reward, including an abstraction reward that encourages semantically valid substitutions and mitigates hallucinations. Experiments on object recognition and face identity tasks show that HFRU achieves over 98% forgetting and retention performance, while introducing negligible object hallucination, significantly outperforming prior methods.Our code and implementation details are available at https://github.com/XMUDeepLIT/HFRU.

2605.08030 2026-05-11 cs.CV cs.LG

PET-Adapter: Test-Time Domain Adaptation for Full and Limited-Angle PET Image Reconstruction

PET-Adapter:全角度和有限角度PET图像重建的测试时间域适应

Rüveyda Yilmaz, Yuli Wu, Johannes Stegmaier, Volkmar Schulz

发表机构 * Institute of Imaging and Computer Vision, RWTH Aachen University, Germany(成像与计算机视觉研究所,亚琛工业大学,德国) Machine Learning for Medical Data, Heinrich Heine University Düsseldorf, Germany(医学数据机器学习,杜伊斯堡-艾森大学,德国)

AI总结 PET-Adapter通过测试时间域适应框架,使预训练于phantom数据的生成式PET重建模型适应不同解剖结构、示踪剂和扫描仪配置的临床数据,无需配对真实数据,提升全角度和有限角度重建性能。

详情
AI中文摘要

正电子发射断层扫描(PET)图像重建固有地受到泊松噪声和物理退化因素的挑战,特别是在有限角度采集中更为严重。尽管深度学习方法表现出有前途的性能,但其在未见临床数据分布上的泛化能力仍受限于广泛的重新训练。我们提出PET-Adapter,一种用于生成式PET重建模型的测试时间域适应框架,该模型仅在phantom数据上预训练。我们的方法使模型能够适应具有不同解剖结构、示踪剂和扫描仪配置的临床数据,而无需配对的真实数据。PET-Adapter在适应过程中引入逐层低秩解剖条件化,并采用有序子集期望最大化(OSEM)为基础的预热启动,从物理引导的重建中初始化生成,将扩散步骤从50减少到2而不影响质量。在多个临床数据集上的实验显示,在全角度和有限角度设置中均表现出优越的3D重建性能,突显了所提出方法的临床可行性和计算效率。

英文摘要

Positron Emission Tomography (PET) image reconstruction is inherently challenged by Poisson noise and physical degradation factors, which are further exacerbated in limited-angle acquisitions. While deep learning methods demonstrate promising performance, their generalization to unseen clinical data distributions remains limited without extensive retraining. We propose PET-Adapter, a test-time domain adaptation framework for generative PET reconstruction models pretrained solely on phantom data. Our method enables adaptation to clinical datasets with varying anatomies, tracers, and scanner configurations without requiring paired ground truth. PET-Adapter introduces layer-wise low-rank anatomical conditioning during adaptation and Ordered Subset Expectation Maximization-based warm-starting that initializes the generation from physics-informed reconstructions, reducing diffusion steps from 50 to 2 without compromising quality. Experiments across multiple clinical datasets demonstrate superior 3D reconstruction performance in both full-angle and limited-angle settings, highlighting the clinical feasibility and computational efficiency of the proposed approach.

2605.08029 2026-05-11 cs.CV cs.LG

STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

STARFlow2:连接语言模型与归一化流以实现统一的多模态生成

Ying Shen, Tianrong Chen, Yuan Gao, Yizhe Zhang, Yuyang Wang, Miguel Ángel Bautista, Shuangfei Zhai, Joshua M. Susskind, Jiatao Gu

发表机构 * Apple(苹果公司)

AI总结 本文提出STARFlow2,通过归一化流与预训练视觉语言模型结合,实现高效的多模态生成,采用深度浅层流设计和统一FAE潜在空间,提升生成效率与性能。

Comments 19 pages, 9 figures

详情
AI中文摘要

深度生成模型在文本和视觉领域迅速发展,推动了能够理解、推理和生成交织文本-图像序列的统一多模态系统。现有方法将自回归语言模型与基于扩散的图像生成器结合,继承了因果文本生成与迭代视觉去噪之间的结构不匹配。我们发现自回归归一化流是自回归Transformer,共享相同的因果掩码、KV缓存机制和从左到右的结构,使其成为真正统一多模态生成的最自然范式。我们提出了STARFlow2,基于Pretzel架构,通过残差跳跃连接垂直交织预训练VLM流与TarFlow流,两者均在相同的因果掩码下运行。结合深度浅层流设计和统一FAE潜在空间,STARFlow2实现了缓存友好的交织生成,其中文本和视觉输出可直接进入KV缓存而无需重新编码。实验表明,STARFlow2在图像生成和多模态理解基准上表现出色,验证了自回归流作为统一多模态建模基础的有效性。

英文摘要

Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising. We observe that autoregressive normalizing flows are autoregressive Transformers--sharing the same causal mask, KV-cache mechanism, and left-to-right structure as LLMs--making them the most natural paradigm for true unified multimodal generation. We present STARFlow2, built on the Pretzel architecture that vertically interleaves a pretrained VLM stream with a TarFlow stream via residual skip connections, both operating under the same causal mask. Combined with a deep-shallow flow design and a unified FAE latent space, STARFlow2 enables cache-friendly interleaved generation where both text and visual outputs directly enter the KV-cache without re-encoding. Experiments demonstrate strong performance across image generation and multimodal understanding benchmarks, validating autoregressive flows as a viable foundation for unified multimodal modeling.

2605.08028 2026-05-11 cs.LG cs.SY eess.SY

Adaptive Domain Decomposition Physics-Informed Neural Networks for Traffic State Estimation with Sparse Sensor Data

自适应域分解物理信息神经网络用于稀疏传感器数据的交通状态估计

Eunhan Ka, Ludovic Leclercq, Satish V. Ukkusuri

发表机构 * Lyles School of Civil and Construction Engineering, Purdue University(普渡大学土木与建设工程学院)

AI总结 本文提出自适应域分解物理信息神经网络(ADD-PINN),通过残差引导框架实现基于LWR模型的离线速度场重建,在稀疏传感器数据下取得更低的L2误差和更快的训练速度。

Comments 56 pages, 5 figures, 12 tables. Submitted to Transportation Research Part C

详情
AI中文摘要

从稀疏固定传感器估计交通状态具有挑战性,因为物理信息神经网络(PINNs)倾向于过度平滑Lighthill-Whitham-Richards(LWR)模型允许的激波。本文提出自适应域分解物理信息神经网络(ADD-PINN),一种基于LWR的离线速度场重建的两阶段残差引导框架。首先训练一个粗粒度全局PINN;然后利用其空间残差剖面确定子域边界并初始化子网络,在分解模式下进行分解,同时数据驱动的激波指示器可以在局部证据较弱时保留单域回退。主要的离线I-24 MOTION评估覆盖五天、五种传感器配置和每种配置十个种子,总计1500次运行。与神经网络和物理信息基线相比,ADD-PINN在25种配置中的18种和15种稀疏传感情况中的14种中取得了最低的相对L2误差,同时训练速度比扩展PINN(XPINN)基线快2.4倍。消融研究支持空间-only分解作为固定传感器交通重建的有效默认设置。补充下一代模拟(NGSIM)实验作为负对照:激波指示器在所有50次运行中抑制分解,而默认的单域回退在所有传感器配置中排名第一。这些结果支持残差引导的空间分解作为PINN家族设计在稀疏固定传感与局部转换区域重合时的有效方法。

英文摘要

Traffic state estimation from sparse fixed sensors is challenging because physics-informed neural networks (PINNs) tend to over-smooth the shockwaves admitted by the Lighthill-Whitham-Richards (LWR) model. This study proposes Adaptive Domain Decomposition Physics-Informed Neural Networks (ADD-PINN), a two-stage residual-guided framework for LWR-based offline speed-field reconstruction. A coarse global PINN is first trained; its spatial residual profile is then used to place subdomain boundaries and initialize child subnetworks in a decomposition-enabled mode, while a data-driven shock indicator can retain a single-domain fallback when localized evidence of transition is weak. The primary offline I-24 MOTION evaluation spans five days, five sensor configurations, and ten seeds per configuration, yielding 1,500 runs in total. Against neural and physics-informed baselines, ADD-PINN attains the lowest relative L2 error in 18 of 25 configurations and in 14 of 15 sparse-sensing cases, while training 2.4 times faster than the extended PINN (XPINN) baseline. An ablation study supports spatial-only decomposition as an effective default for fixed-sensor traffic reconstruction in the evaluated settings. Supplementary Next Generation Simulation (NGSIM) experiments serve as a negative control: the shock indicator suppresses decomposition in all 50 runs, and the default single-domain fallback ranks first across all sensor configurations. These results support residual-guided spatial decomposition as an effective PINN-family design for offline reconstruction when sparse fixed sensing coincides with localized transition regions.

2605.08024 2026-05-11 cs.AI

MPD$^2$-Router: Mask-aware Multi-expert Prior-regularized Dual-head Deferral Router in Glaucoma Screening and Diagnosis

MPD$^2$-Router:基于掩码的多专家优先正则化双头延迟路由在青光眼筛查与诊断中的应用

Wenxin Zhan

发表机构 * Baruch college(巴克莱学院)

AI总结 本文提出MPD$^2$-Router,通过结合双头延迟/分配策略和掩码感知的Gumbel-sigmoid门控,解决青光眼筛查中专家资源分配问题,提升诊断准确率并降低临床成本。

详情
AI中文摘要

本文提出MPD$^2$-Router,通过结合双头延迟/分配策略和掩码感知的Gumbel-sigmoid门控,解决青光眼筛查中专家资源分配问题,提升诊断准确率并降低临床成本。

英文摘要

Learning-to-defer (L2D) can make glaucoma screening safer by routing difficult/uncertain cases to humans, yet standard formulations overlook expert availability, heterogeneous readers behavior, workload imbalance, asymmetric diagnostic harm, case difficulty from morphology and deployment shift. We introduce MPD$^2$-Router, a mask-aware multi-expert deferral framework that recasts ophthalmic triage as constrained human--AI routing: whether to defer and to which available expert. It couples a dual-head deferral/allocation policy with mask-aware Gumbel--sigmoid gating that strictly enforces per-sample availability, and fuses uncertainty, morphology, image-quality, and OOD signals. Training uses an asymmetric cost-sensitive objective with an augmented-Lagrangian deferral budget, a group-specific distribution prior, and a rank-majorization JS regularizer that jointly prevent expert collapse without forcing uniform allocation. Across three cross-national glaucoma cohorts (REFUGE, CHAKSU, ORIGA) with a frozen REFUGE-trained backbone, MPD$^2$-Router substantially lowers clinical cost and improves MCC over AI-only at a moderate deferral rate. It is Pareto-optimal in F1--MCC--cost, robust under cross-domain shift, and yields balanced expert utilization.

2605.08020 2026-05-11 cs.RO

Active Embodiment Identification with Reinforcement Learning for Legged Robots

基于强化学习的腿足机器人主动具身识别

Nico Bohlinger, Jan Peters

发表机构 * Department of Computer Science, Technical University of Darmstadt(德累斯顿技术大学计算机科学系) Robotics Institute Germany (RIG)(德国机器人研究所(RIG);德国人工智能研究中心(DFKI);hessian.AI) German Research Center for AI (DFKI) hessian.AI

AI总结 本文提出一种结合信息寻求行为与显式具身预测的主动具身识别方法,通过仿真环境中的交互学习不同形态机器人的关节级和全局具身参数。

详情
AI中文摘要

我们提出了一种用于腿足机器人的主动具身识别方法,该方法联合学习信息寻求行为和显式具身预测。使用历史增强的URMA架构,该方法通过在不同形态中与环境的交互推断关节级和全局具身参数。

英文摘要

We present an active embodiment identification method for legged robots that jointly learns information-seeking behavior and explicit embodiment prediction. Using a history-augmented URMA architecture, the method infers joint-level and global embodiment parameters through interaction with the environment in simulation across different morphologies.

2605.08019 2026-05-11 cs.AI q-bio.NC

Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

玩乐的动机:前沿LRMs与人类游戏学习者的行为与大脑对齐

Botos Csaba, Sreejan Kumar, Austin Tudor David Andrews, Laurence Hunt, Chris Summerfield, Joshua B. Tenenbaum, Rui Ponte Costa, Marcelo G. Mattar, Momchil Tomov

发表机构 * University of Oxford(牛津大学) Columbia University(哥伦比亚大学) New York University(纽约大学) Massachusetts Institute of Technology(麻省理工学院) Harvard University(哈佛大学)

AI总结 本文研究了前沿LRMs在游戏学习中的行为与大脑活动对齐情况,发现其在游戏发现阶段更接近人类行为,并能更准确预测大脑活动。

详情
AI中文摘要

人类在遇到新环境时能快速学习抽象知识并灵活应用。现代AI系统能否以类似方式学习和规划?本文利用复杂的人类游戏数据集和同时进行的fMRI记录,研究参与者学习需要规则发现、假设修正和多步骤规划的新视频游戏。通过评估模型在游戏表现、匹配人类学习行为和预测大脑活动的能力,比较了前沿大型推理模型(LRMs)与模型无关和模型基于的深度强化学习代理以及基于贝叶斯理论的代理。研究发现,前沿LRMs在游戏发现阶段最接近人类行为模式,并在皮层和皮层下区域的脑活动预测上比强化学习替代方案好一个数量级,效果在排列控制下仍稳健。通过针对性操纵,进一步显示大脑对齐反映模型对游戏状态的上下文表示,而非其下游规划或推理。研究结果确立了LRMs作为复杂自然环境人类学习和决策的有说服力的计算解释。项目页面含交互式回放:https://botcs.github.io/reason-to-play/

英文摘要

Humans rapidly learn abstract knowledge when encountering novel environments and flexibly deploy this knowledge to guide efficient and intelligent action. Can modern AI systems learn and plan in a similar way? We study this question using a dataset of complex human gameplay with concurrent fMRI recordings, in which participants learn novel video games that require rule discovery, hypothesis revision, and multi-step planning. We jointly evaluate models by their ability to play the games, match human learning behavior, and predict brain activity during the same task, comparing a suite of frontier Large Reasoning Models (LRMs) against model-free and model-based deep reinforcement learning agents and a Bayesian theory-based agent. We find that frontier LRMs most closely match human behavioral patterns during game discovery and predict brain activity an order of magnitude better than both reinforcement learning alternatives across cortical and subcortical regions, with effects robust to permutation controls. Through targeted manipulations, we further show that brain alignment reflects the model's in-context representation of the game state rather than its downstream planning or reasoning. Our results establish LRMs as compelling computational accounts of human learning and decision making in complex, naturalistic environments. Project page with interactive replays: https://botcs.github.io/reason-to-play/

2605.08013 2026-05-11 cs.AI

Learning CLI Agents with Structured Action Credit under Selective Observation

基于选择性观察的结构化行动信用学习CLI代理

Haoyang Su, Ying Wen

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文研究了CLI代理在选择性观察和行动信用分配中的瓶颈,提出σ-Reveal和A³方法,构建ShellOps数据集以评估CLI任务学习。

详情
AI中文摘要

命令行界面(CLI)代理正在成为一种实用的agent-计算机交互范式,用于演变文件系统、可执行命令行程序和在线执行反馈。最近的工作使用强化学习(RL)从可验证的任务反馈中学习这些交互能力,但很少有方法利用CLI行动的原生结构化属性作为学习信号。除了这一未充分利用的行动结构外,CLI学习还耦合了两种对编码代理而言的瓶颈。首先,代理必须在大规模代码库中从部分观察中识别任务相关证据。其次,稀疏终端奖励必须分配给塑造长多轮轨迹的动作。我们通过shell驱动的信息提取和文件编辑任务研究这些瓶颈。对于选择性观察,我们引入σ-Reveal,一种推理时间机制,为同一CLI选择token预算的上下文。对于信用分配,我们提出行动优势分配(A³),一种原生代理RL方法,保留了标准代理RL的算法复杂性。A³从episode级相对反馈、基于抽象语法树(AST)的动作子链残差和tree级轨迹边际构建turn级优势。为进一步评估此问题设置,我们构建ShellOps,一个覆盖CLI任务的可验证数据集套件。

英文摘要

Command line interface (CLI) agents are emerging as a practical paradigm for agent-computer interaction over evolving filesystems, executable command line programs, and online execution feedback. Recent work has used reinforcement learning (RL) to learn these interaction abilities from verifiable task feedback, yet few methods exploit the native structured attributes of CLI actions as learning signals. Beyond this underused action structure, CLI learning also couples two bottlenecks for coding agents. First, the agent must identify task-relevant evidence in a large codebase from partial observations. Second, sparse terminal rewards must be assigned to the actions that shape a long multi-turn trajectory. We study these bottlenecks through shell-driven information extraction and file editing tasks. For selective observation, we introduce $σ$-Reveal, an inference-time mechanism that selects token-budgeted context for the same CLI. For credit assignment, we propose Action Advantage Assignment ($\mathrm{A}^3$), a native agentic RL method that preserves the algorithmic complexity of standard agentic RL. $\mathrm{A}^3$ constructs turn-level advantages from episode-level relative feedback, abstract syntax tree (AST) based action sub-chain residuals, and tree-level trajectory margins. To further evaluate this problem setting, we construct ShellOps, a verifiable dataset suite covering CLI tasks in repository environments.

2605.08011 2026-05-11 cs.AI stat.CO

Abductive Reasoning with Probabilistic Commonsense

基于概率常识的归纳推理

Joseph Cotnareanu, Chiara Roverato, Han Zhou, Didier Chetelat, Yingxue Zhang, Mark Coates

发表机构 * International Laboratory on Learning Systems(学习系统国际实验室) McGill University(麦吉尔大学) Mila - Quebec Artificial Intelligence Institute(魁北克人工智能研究所) Huawei Noah's Ark Lab(华为诺亚实验室)

AI总结 本文提出PACS算法,通过LLM和形式求解器结合,建模常识认知的个体差异,提升大语言模型的归纳推理能力。

Journal ref Proceedings of the International Conference on Machine Learning, 2026

详情
AI中文摘要

近期改进大语言模型(LLM)推理能力的努力集中在将形式逻辑求解器整合到神经符号框架中。关键挑战在于形式求解器缺乏常识世界知识,无法进行人类认为显而易见的推理步骤。先前方法通过使用LLM提供缺失的常识假设来解决这一问题,但这些方法隐含假设对常识事实有普遍共识。实际上,常识信念在个体间存在差异。我们提出了一种概率框架,用于归纳常识推理,明确建模这种差异,旨在确定大多数人是否会判断一个陈述为真或假。我们引入了概率归纳常识(PACS),一种新颖的算法,利用LLM和形式求解器来采样证明作为个体不同常识信念的观察,并在这些样本上汇总结论。实证表明,PACS在多个基准上优于链式推理、先前神经符号方法和基于搜索的方法。

英文摘要

Recent efforts to improve the reasoning abilities of Large Language Models (LLMs) have focused on integrating formal logic solvers within neurosymbolic frameworks. A key challenge is that formal solvers lack commonsense world knowledge, preventing them from making reasoning steps that humans find obvious. Prior methods address this by using LLMs to supply missing commonsense assumptions, but these approaches implicitly assume universal agreement on such commonsense facts. In reality, commonsense beliefs vary across individuals. We propose a probabilistic framework for abductive commonsense reasoning that explicitly models this variation, aiming to determine whether most people would judge a statement as true or false. We introduce Probabilistic Abductive CommonSense (PACS), a novel algorithm that uses an LLM and a formal solver to sample proofs as observations of individuals' distinct commonsense beliefs, and aggregates conclusions across these samples. Empirically, PACS outperforms chain-of-thought reasoning, prior neurosymbolic methods, and search-based approaches across multiple benchmarks.

2605.08007 2026-05-11 cs.LG

Interpreting Reinforcement Learning Agents with Susceptibilities

用脆弱性解释强化学习智能体

Chris Elliott, Einar Urdshals, David Quarel, Daniel Murfet

发表机构 * Timaeus

AI总结 本文将神经网络可解释性技术扩展到深度强化学习的后悔设定,探讨脆弱性在简单网格世界模型中的应用,揭示模型在参数空间中的内部发展特征。

Comments 55 pages, comments welcome

详情
AI中文摘要

Susceptibilities是一种神经网络可解释性技术,研究后验期望值对损失扰动的响应。本文将此构造推广到深度强化学习的后悔设定,并在简单网格世界模型中探讨脆弱性的效用。我们论证脆弱性揭示了模型在参数空间中的内部发展特征,这些特征无法仅通过研究学习策略的发展来检测。我们通过激活引导验证了这些结果,并讨论了该框架扩展到RLHF预训练后的应用。

英文摘要

Susceptibilities are a technique for neural network interpretability that studies the response of posterior expectation values of observables to perturbations of the loss. We generalize this construction to the setting of the regret in deep reinforcement learning and investigate the utility of susceptibilities in a simple gridworld model that nevertheless exhibits non-trivial stagewise development. We argue that susceptibilities reveal internal features of the development of the model in parameter space that one cannot detect purely by studying the development of the learned policy. We validate these results with activation-steering, and discuss the framework's extension to RLHF post-training.

2605.08005 2026-05-11 cs.LG

STEPS: A Temporal Smooth Error Propagation Solver on the Manifolds for Test-Time Adaptation in Time Series Forecasting

STEPS:一种用于时间序列预测中测试时间适应的时空平滑误差传播求解器

Jiaqi Liu, Yifan Ouyang, Zhifei Song, Sim Kuan Goh, Ashwaq Qasem

发表机构 * School of Artificial Intelligence and Robotics, Xiamen University Malaysia(厦门大学马来西亚分校人工智能与机器人学院) S.M.A.R.T. NEXUS Centre of Excellence, Xiamen University Malaysia(厦门大学马来西亚分校S.M.A.R.T. NEXUS卓越研究中心)

AI总结 STEPS通过将测试时间适应转化为时间流形上的Dirichlet边界值问题,结合局部求解器、全局求解器和时空流形融合,提升稀疏或噪声前缀下的预测精度,平均相对MSE降低26.82%。

Comments 9 pages main text, appendix included. 7 figures. Submitted to NeurIPS 2026

详情
AI中文摘要

测试时间适应(TTA)旨在通过推理期间揭示的有限观测来提高时间序列预测的分布偏移鲁棒性。然而,TTA必须在无源在线环境下运行,其中适应信号短、时间相关且可能含噪声。现有方法在揭示前缀稀疏或受污染时可能面临识别性差、误差累积和长周期修正不稳定的问题。为解决这些问题,我们提出了STEPS,一种用于时间序列预测的平滑时间误差传播求解器。STEPS将预测TTA重新表述为时间流形上的Dirichlet边界值问题,其中揭示的前缀误差作为未知未来误差场的边界条件。然后,STEPS在预测空间中求解一个平滑且有界的修正场:局部求解器在时间平滑性下传播前缀误差,全局求解器检索稳定跨窗口误差记忆,时空流形融合(SMF)将两种解决方案整合到最终修正中。在六个标准基准和四个冻结后端上,STEPS在零样本后端上平均相对MSE降低26.82%,超过最强的TTA基线12.77%。额外的稀疏前缀和污染测试确认了STEPS在有限和噪声前缀下的鲁棒性。

英文摘要

Test-Time Adaptation (TTA) aims to improve time series forecasting under distribution shifts by using limited observations revealed during inference. However, forecasting TTA must operate in a source-free online setting, where the adaptation signal is short, temporally correlated, and potentially noisy. Existing methods can therefore suffer from weak identifiability, error accumulation, and unstable long-horizon corrections when the revealed prefix is sparse or contaminated. To address these issues, we propose STEPS, a Smooth Temporal Error Propagation Solver for TTA in time-series forecasting. STEPS reformulates forecasting TTA as a Dirichlet Boundary Value Problem on a temporal manifold, where the revealed prefix error serves as the boundary condition for the unknown future error field. Then, STEPS solves a smooth and bounded correction field in prediction space: a Local Solver propagates prefix errors under temporal smoothness, a Global Solver retrieves stable cross-window error memory and Spatiotemporal Manifold Fusion (SMF) integrates both solutions into the final correction. Across six standard benchmarks and four frozen backbones, STEPS achieves an average relative MSE reduction of 26.82% over the zero-shot backbone, exceeding the strongest compared TTA baseline by 12.77%. Additional sparse prefix and contamination tests confirm the robustness of STEPS under limited and noisy prefixes.