arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 21516
2606.05071 2026-06-04 cs.CV

InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space

InstantRetouch:基于双边空间的高效高保真指令引导图像润色

Jiarui Wu, Yujin Wang, Ruikang Li, Fan Zhang, Mingde Yao, Tianfan Xue

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) CUHK MMLab(香港中文大学多模态实验室) CPII under InnoHK(创新香港下的CPII)

AI总结 提出一种基于双边空间操作的图像润色方法,通过预测低分辨率双边网格并利用学习引导图切片,结合扩散模型蒸馏和提示对齐损失,实现高效、高保真且遵循指令的图像润色。

Comments Computer Vision and Pattern Recognition (CVPR), 2026

详情
AI中文摘要

语言引导的照片润色旨在调整颜色和色调,同时保留几何和纹理。最近,基于扩散的润色显示出优越的视觉质量,但由于其生成性质,常常面临保真度问题,并且由于其迭代采样过程,效率低下。在这项工作中,我们提出了一种高效且保真的润色方法,使用双边空间操作,该方法既紧凑又内容解耦。具体来说,我们的模型不是直接编辑像素或图像潜在表示,而是预测一个低分辨率的仿射变换双边网格,该网格通过学习的引导图进行切片,然后应用于全分辨率图像。这种方法实现了高保真度和更高的效率。为了保留预训练生成模型的强先验,我们使用变分分数蒸馏将多步扩散模型蒸馏到我们的双边网格框架中,并辅以提示对齐损失来指导指令跟随行为。此外,我们引入了一个新的基准,并在多个维度上评估我们的方法:保真度、指令遵循和效率。与最新的润色方法(如Gemini-2.5-Flash(Nano-Banana))相比,我们的方法可以避免内容漂移,显著改善延迟,并生成视觉上令人愉悦的编辑,同时保持高水平的保真度。项目页面:https://openimaginglab.github.io/InstantRetouch/。

英文摘要

Language-guided photo retouching aims to adjust color and tone while preserving geometry and texture. Recently, diffusion-based retouching shows a superior visual quality, but often struggles with both fidelity issues due to its generative nature and efficiency because of its iterative sampling process. In this work, we propose an efficient and fidelity-preserving retouching method using bilateral space manipulation, which is both compact and content-decoupled. Specifically, instead of directly editing pixels or image latents, our model predicts a low-resolution bilateral grid of affine transforms, which are sliced using a learned guidance map and then applied to the full-resolution image. This approach yields both high fidelity and improved efficiency. To retain strong priors of a pretrained generative model, we distill a multi-step diffusion model into our bilateral grid framework using Variational Score Distillation, complemented by a prompt alignment loss to guide instruction-following behavior. Additionally, we introduce a new benchmark and evaluate our method across multiple dimensions: fidelity, instruction following, and efficiency. Compared to the latest retouch methods, like Gemini-2.5-Flash (Nano-Banana), our method can avoid content drift, significantly improve latency, and generate visually pleasing edits, while maintaining a high level of fidelity. Project page: https://openimaginglab.github.io/InstantRetouch/.

2606.05070 2026-06-04 cs.LG

RIDE: An Open Dataset and Benchmark for Train Delay Prediction

RIDE:用于列车延误预测的开放数据集与基准

Clément Elliker, Mathis Le Bail, Clément Mantoux, Jesse Read, Sonia Vanier

发表机构 * LIX, École Polytechnique, IP Paris, France(巴黎理工学院LIX研究所,IP巴黎,法国) e.SNCF Solutions, France(法国e.SNCF解决方案)

AI总结 针对列车延误预测缺乏标准化数据集和评估协议的问题,构建了覆盖比利时全国铁路网的开放数据集RIDE,并基于非学习、统计学习和深度学习模型进行了首次全面比较评估。

Comments 58 pages, 41 figures

详情
AI中文摘要

列车延误预测对乘客和铁路运营商都是一个重要问题,但由于缺乏标准化的数据集、预测目标和评估协议,该领域的进展仍然难以评估。为了解决这一问题,我们引入了RIDE,一个在比利时铁路网全国范围内构建的开放数据集和基准。RIDE涵盖了2023年至2025年的9450万次列车事件、360万次行程和3570万条天气记录。它被组织成一个分层数据管道,从原始铁路和天气数据源到两个公开发布版本:一个可重用的中间关系数据集和模型就绪的基准数据集。该基准标准化了预测任务以及训练和测试数据。它还提供了一个统一的评估协议,支持模型间的直接比较。利用这一框架,我们首次对非学习模型、统计学习模型和深度学习模型进行了全面的比较评估。我们表明,基于学习的方法明显优于非学习模型,其中图神经网络实现了最佳的平均性能,而最强的基于学习模型之间则相对接近。除了聚合的平均绝对误差(MAE)和均方根误差(RMSE)外,该框架还提供了按预测时间范围和延误变化分类的细分结果,从而能够更详细地分析模型在不同预测场景下的行为。

英文摘要

Train delay prediction is an important problem for both passengers and railway operators, yet progress in the field remains difficult to assess due to the lack of standardized datasets, prediction targets, and evaluation protocols. To address this gap, we introduce RIDE, an open dataset and benchmark for train delay prediction built at nationwide scale over the Belgian railway network. RIDE covers 94.5M train events, 3.6M journeys, and 35.7M weather records from 2023 to 2025. It is organized as a layered data pipeline from raw railway and weather sources to two public releases: a reusable intermediate relational dataset and model-ready benchmark datasets. The benchmark standardizes the prediction task and the training and testing data. It also provides a unified evaluation protocol that supports direct comparison across models. Using this framework, we provide the first comprehensive comparative evaluation of non-learning, statistical learning, and deep learning models. We show that learning-based methods clearly outperform non-learning models, with graph neural networks achieving the best mean performance, while the strongest learning-based models remain relatively close to one another. Beyond aggregate mean absolute error (MAE) and root mean squared error (RMSE), the framework also provides breakdowns by prediction horizon and delay change, enabling more detailed analysis of model behavior across forecasting regimes.

2606.05068 2026-06-04 cs.CV

MaCo-GAN: Manifold-Contrastive Adversarial Learning for Single Image Super-Resolution

MaCo-GAN: 用于单图像超分辨率的流形对比对抗学习

Daeyoung Han, Seongmin Hwang, Moongu Jeon

发表机构 * Department of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Republic of Korea(电气工程与计算机科学系,全州科学技术院,全州,韩国) Department of AI Convergence, Gwangju Institute of Science and Technology, Gwangju, Republic of Korea(人工智能融合系,全州科学技术院,全州,韩国)

AI总结 提出MaCo-GAN,通过流形对比对抗学习替代传统对抗损失,利用动态假样本合成器生成保持低分辨率对应的假图像,实现感知-失真权衡的持续改进。

详情
AI中文摘要

传统的用于单图像超分辨率(SISR)的生成对抗网络(GAN)常常出现幻觉伪影,这主要是因为标准判别器评估整体图像自然度而非严格的条件真实性。为了解决这个问题,我们提出了MaCo-GAN,一种新颖的流形对比GAN框架,用监督对比目标替代了传统的对抗损失。我们方法的核心是一个动态假样本合成器,它将真实数据(GT)转换为一系列具有挑战性、感知上合理且严格保持低分辨率(LR)对应的假图像。利用这些合成样本,我们建立了一个鲁棒的对比极小极大博弈:生成器被训练为将其预测吸引到流形上的假图像(低失真)并远离流形外的假图像(高失真),而判别器则优化完全相反的目标。通过简单地将基线SR模型的对抗损失替换为我们提出的目标,我们在各种基准测试中展示了感知-失真权衡的持续改进。广泛的消融研究验证了我们框架的有效性,并深入洞察了这种条件对比博弈的动态。

英文摘要

Conventional Generative Adversarial Networks (GANs) for Single Image Super-Resolution (SISR) often struggle with hallucinated artifacts, largely because standard discriminators evaluate overall image naturalness rather than strict conditional realism. To address this, we propose MaCo-GAN, a novel manifold-contrastive GAN framework that replaces the conventional adversarial loss with a supervised contrastive objective. A core component of our method is a dynamic fake sample synthesizer that transforms ground truth (GT) data into a spectrum of challenging, perceptually plausible fake images that strictly maintain low-resolution (LR) correspondence. Utilizing these synthesized samples, we establish a robust contrastive minimax game: the generator is trained to attract its predictions toward on-manifold fakes (low distortion) and repel them from off-manifold fakes (high distortion), while the discriminator optimizes the exact opposite. By simply replacing the adversarial loss of a baseline SR model with our proposed objective, we demonstrate consistent improvements in the perception-distortion trade-off across various benchmarks. Extensive ablation studies validate the effectiveness of our framework and provide deep insights into the dynamics of this conditional contrastive game.

2606.05067 2026-06-04 cs.LG

FLAGG: Flexible Autoregressive Graph Generation

FLAGG:灵活自回归图生成

Samuel Cognolato, Alessandro Sperduti, Luciano Serafini

发表机构 * Department of Mathematics, University of Padova(帕多瓦大学数学系) Fondazione Bruno Kessler (FBK)(布鲁诺·克瑟研究所) Department of Information Engineering and Computer Science, University of Trento(特伦托大学信息工程与计算机科学系)

AI总结 提出FLAGG框架,通过将一次性模型与自回归顺序生成相结合,灵活处理不同规模和拓扑的图生成任务,在多个数据集上优于纯一次性或纯自回归基线。

Comments Accepted for publication at JMLR, currently in press

详情
AI中文摘要

深度图生成的全景涵盖了两个极端:一次性模型和顺序模型。前者联合生成节点和边,而后者以自回归方式采样它们。每种方法在不同图域中根据大小和拓扑表现更好,但都不适用于所有图类别。例如,一次性方法难以生成大图,而顺序方法在小图上表现不佳。克服这些限制的一种可能方法是在一个统一系统中灵活结合这两种方法。在这项工作中,我们提出了FLAGG(灵活自回归图生成)框架,该框架使用一次性模型顺序生成图的部分。FLAGG可以应用任何一次性模型使其自回归,从而灵活选择顺序策略。该策略通过一个随机节点移除过程来指定,插入模型学习逆转该过程。我们使用DiGress一次性模型在多个不同图大小和领域的数据集上评估FLAGG。结果表明,该方法在采样质量上优于一次性基线和自回归基线。

英文摘要

The Deep Graph Generation's panorama spans two extremes: one-shot and sequential models. The former generates nodes and edges jointly, while the latter samples them autoregressively. Each method performs better in different graph domains depending on size and topology, but neither is applicable to all graph categories. For instance, one-shot methods struggle with generating large graphs, while sequential methods underperform on smaller graphs. A possible way to overcome these limitations is to flexibly combine the two methods in a unique system. In this work, we propose the FLAGG (Flexible Autoregressive Graph Generation) framework, which sequentially generates portions of graphs with one-shot models. FLAGG can apply any one-shot model to make it autoregressive, allowing flexibility in choosing the sequential policy. This policy is specified through a stochastic node removal process, which an Insertion Model learns to reverse. We evaluate FLAGG with the DiGress one-shot model on several data sets of different graph sizes and domains. We show that the approach outperforms both one-shot and autoregressive baselines in terms of sampling quality.

2606.05058 2026-06-04 cs.CV cs.AI

UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD

UniCAD:面向多模态多任务CAD的统一基准与通用模型

Jingyuan Chen, Sheng Jin, Haopeng Sun, Wentao Liu, Chen Qian

发表机构 * SenseTime Research and Tetras.AI(秒速科技研究院和Tetras.AI)

AI总结 针对CAD领域缺乏统一多模态基准的问题,提出UniCAD基准和UniCAD-MLLM通用多模态大语言模型,在点云到CAD重建、文本/图像到CAD生成和CAD问答等任务上实现端到端统一处理,并在多个基准上取得最优性能。

详情
AI中文摘要

计算机辅助设计(CAD)通过创建精确、可编辑的3D模型,支撑着现代工程和制造。然而,CAD研究通常孤立地研究各项任务,而多模态、多任务学习因缺乏统一基准而受阻。为解决这一问题,我们引入了UniCAD,一个全面的多模态CAD学习基准,涵盖点云到CAD重建、文本/图像到CAD生成以及CAD问答等多种输入模态。伴随该基准,我们提出了UniCAD-MLLM,一个通用的多模态大语言模型,能够接收文本、图像、草图和点云,并在单一框架内以端到端方式执行这些异构任务。在UniCAD和Fusion360基准上的大量实验表明,UniCAD-MLLM在所有任务上均达到最先进性能,优于现有的任务特定和多任务基线。我们将发布数据集、代码和预训练模型,以加速未来研究。

英文摘要

Computer-Aided Design (CAD) underpins modern engineering and manufacturing by enabling the creation of precise, editable 3D models. However, CAD research typically studies tasks in isolation, and multi-modal, multi-task learning for CAD is hindered by the absence of a unified benchmark. To address this gap, we introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning that covers point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering across diverse input modalities. Alongside the benchmark, we present UniCAD-MLLM, a universal multi-modal large language model that ingests text, images, sketches, and point clouds and performs these heterogeneous tasks in an end-to-end fashion within a single framework. Extensive experiments on the UniCAD and Fusion360 benchmarks demonstrate that UniCAD-MLLM achieves state-of-the-art performance across all tasks, outperforming existing task-specific and multi-task baselines. We will release the dataset, code, and pretrained models to accelerate future research.

2606.05054 2026-06-04 cs.CL

Boosting Self-Consistency with Ranking

通过排序提升自洽性

Maria Marina, Daniil Moskovskiy, Sergey Pletenev, Mikhail Salnikov, Alexander Panchenko, Viktor Moskvoretskii

发表机构 * AIRI Skoltech(斯克利切夫斯基因工大学) EPFL(瑞士联邦理工学院)

AI总结 提出RISC方法,将自洽性中的答案选择转化为排序问题,使用轻量级LambdaRank模型结合五个特征,在多个数据集上实现了比标准自洽性更好的准确率-效率权衡。

Comments 16 pages, 13 figures, accepted at ACL Student Research Workshop 2026

详情
AI中文摘要

自洽性通过采样多条推理路径并选择最频繁的答案来改进大型语言模型,但多数投票通常无法恢复样本中已经存在的正确答案。我们通过排序改进自洽性(RISC)解决了这一限制,该方法将自洽性中的答案选择重新表述为排序问题。RISC不是依赖单一的不确定性或置信度信号,而是使用轻量级LambdaRank模型,通过五个精心设计的特征对候选答案进行评分,这些特征捕捉了答案频率、语义中心性和推理轨迹一致性。我们在三个数据集上评估了RISC,涵盖了多种测试时预算。在数据集上,RISC始终比标准自洽性和强基线实现了更好的准确率-效率权衡,在问答基准上尤其取得了显著提升。进一步分析表明,所提出的特征各自有用,更重要的是具有互补性,凸显了学习组合多个信息信号以进行测试时答案选择的价值。

英文摘要

Self-consistency improves large language models by sampling multiple reasoning paths and selecting the most frequent answer, but majority voting often fails to recover correct answers that are already present among the samples. We address this limitation with Ranking-Improved Self-Consistency (RISC), which reformulates answer selection in self-consistency as a ranking problem. Instead of relying on a single uncertainty or confidence signal, RISC uses a lightweight LambdaRank model to score candidate answers with five carefully designed features that capture answer frequency, semantic centrality, and reasoning-trace consistency. We evaluate RISC on three datasets under a range of test-time budgets. Across datasets, RISC consistently achieves a better accuracy-efficiency trade-off than standard self-consistency and strong baselines, with particularly large gains on question answering benchmarks. Further analysis shows that the proposed features are individually useful and, more importantly, complementary, highlighting the value of learning to combine multiple informative signals for test-time answer selection.

2606.05046 2026-06-04 cs.LG stat.ML

Graph Cascades: Contagion-Based Mesoscopic Rewiring for Structure-Aware Graph Machine Learning

图级联:基于传染的介观重连用于结构感知图机器学习

Meher Chaitanya, My Le, Luana Ruiz

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出一种基于传染扩散的介观重连策略Graph Cascades,通过构建辅助图增强图神经网络和变换器对中间尺度结构的捕捉能力,在节点分类任务上提升多个骨干网络性能,并理论刻画了重连有效的条件。

详情
AI中文摘要

我们引入图级联(Graph Cascades),一种用于图神经网络(GNN)和图变换器(GT)的介观重连策略,它能够捕获超出纯局部边或完全全局注意力的中间尺度图结构。基于传染扩散过程,Graph Cascades 在 O(|V|+|E|) 时间内构建一个辅助图,其中由重复多跳强化支持的节点对被提升为直接邻居。我们从理论上刻画了基于强化的重连何时有帮助:强化边选择比直接邻接更标签对齐的充分条件,一个两跳强化完全同质的 SBM 示例,以及通过图有效电阻对介观连通性的形式化。实验上,在节点分类基准测试中,Graph Cascades 改进了多个 GNN 和稀疏 GT 骨干网络,在异质图和中等至高同质度图上观察到最可靠的增益。理论条件还识别了介观重连不太可能有益的场景——低度正则图和存在结构瓶颈的图——这些预测与观察到的失败相符。我们还观察到重连图中性能与结构属性之间的紧密相关性。

英文摘要

We introduce Graph Cascades, a mesoscopic rewiring strategy for Graph Neural Networks (GNNs) and Graph Transformers (GTs) that captures intermediate-scale graph structure beyond purely local edges or fully global attention. Using contagion-based diffusion processes, Graph Cascades constructs, in O(|V|+|E|) time, an auxiliary graph where node pairs supported by repeated multi-hop reinforcement are promoted to direct neighbors. We theoretically characterize when reinforcement-based rewiring helps: sufficient conditions under which reinforcement-based edge selection is more label-aligned than direct adjacency, an SBM witness in which two-hop reinforcement is perfectly homophilic, and a formalization of mesoscopic connectivity via graph effective resistance. Empirically, across node-classification benchmarks, Graph Cascades improves multiple GNN and sparse-GT backbones, with the most reliable gains observed on heterophilic and moderate- to high-degree homophilic graphs. The theoretical conditions also identify regimes where mesoscopic rewiring is unlikely to be beneficial -- low-degree regular graphs and graphs with structural bottlenecks -- and these predictions match the observed failures. We additionally observe tight correlations between performance and structural properties in the rewired graphs.

2606.05043 2026-06-04 cs.AI

Strabo: Declarative Specification and Implementation of Agentic Interaction Protocols

Strabo: 声明式规范与实现代理交互协议

Samuel H. Christie, Amit K. Chopra, Munindar P. Singh

发表机构 * North Carolina State University(北卡罗来纳州立大学) Lancaster University(兰卡斯特大学)

AI总结 提出 Strabo,通过声明式交互协议建模 UCP 的结账部分,并利用 Peach 编程模型实现代理,展示声明式规范的优势,同时实现与 Google UCP 代理的互操作,为 EMAS 思想在实践中的渐进引入提供路径。

Comments Presented in the Engineering Multiagent Systems Workshop co-located with the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
AI中文摘要

过去几年中,基于声明式交互协议的多代理系统建模与实现取得了重大进展。我们的贡献 Strabo 确立了这些进展与当前 Agentic AI 行业努力的相关性。具体来说,我们考虑了 UCP(通用商务协议),这是谷歌近期主导的为 AI 代理标准化电子商务交互的努力。我们的工作分为两部分。第一部分,我们将 UCP 中处理结账的部分建模为声明式 Langshaw 协议,并使用 Peach(一种 Langshaw 编程模型)实现代理。这部分工作展示了形式化、声明式规范的优势。第二部分,我们展示了 Peach 代理可以与谷歌实现的 UCP 代理互操作,从而确立了我们的方法相对于 UCP 的保真度。这种互操作使得声明式协议和代理能够逐步引入传统环境,为 EMAS 思想在不要求全面更新的情况下影响实践指明了路径。

英文摘要

The last few years have witnessed major advances in the modeling and implementation of multiagent systems based on declarative interaction protocols. Our contribution, Strabo, establishes the relevance of these advances to ongoing industry efforts in Agentic AI. Specifically, we consider UCP, the Universal Commerce Protocol, a recent Google-led effort to standardize e-commerce interactions for AI agents. Our exercise is in two parts. One, we model the part of UCP dealing with checkouts as a declarative Langshaw protocol and implement agents using Peach, a programming model for Langshaw. This part of the exercise brings out the advantages of formal, declarative specifications. Two, we show that Peach agents can interoperate with UCP agents implemented by Google, thereby establishing the fidelity of our approach with respect to UCP. Such interoperation enables the incremental introduction of declarative protocols and agents into a conventional setting, indicating a pathway by which EMAS ideas could influence practice without demanding a wholesale update.

2606.05042 2026-06-04 cs.LG cs.CL cs.SC

In-Context Graphical Inference

上下文图形推理

Zehua Cheng, Wei Dai, Jiahao Sun

发表机构 * Department of Computer Science, University of Oxford(计算机科学系,牛津大学) FLock.io

AI总结 提出一种自回归图Transformer(ICG-I),通过模拟变量消除并利用张量列压缩和加权共形预测,实现离散图形模型中可扩展且校准的边缘推理,在标准实例和受挫自旋玻璃上达到最先进性能。

Comments 19 Pages

详情
AI中文摘要

离散图形模型中的边缘推理迫使在精确性和可扩展性之间做出选择:精确算法对于高树宽图是难以处理的,而迭代近似(信念传播、变分方法)在受挫拓扑上牺牲了收敛保证。我们认为这种二分法源于归纳偏置不匹配:迭代方法放弃了使精确推理正确的顺序消除结构。我们引入了上下文图形推理(ICG-I),一种自回归图Transformer,通过模拟变量消除并使用学习的张量列压缩中间因子来恢复这种结构,同时结合Dirichlet输出层和加权共形预测,在拓扑偏移下提供校准的、无分布的覆盖保证。我们证明了TT压缩误差在自回归链中最多线性传播,Dirichlet-Multinomial损失是适当的评分规则,并且WCP在估计密度比下保持覆盖且退化可量化。我们进行了大量实验来评估ICG-I,并在所有基准测试中取得了最先进的性能。ICG-I将标准实例上的MAE从0.041(最佳基线)降低到0.020,并在N=500的受挫自旋玻璃上达到0.048,而BP完全发散。

英文摘要

Marginal inference in discrete graphical models forces a choice between exactness and scalability: exact algorithms are intractable for high-treewidth graphs, while iterative approximations (Belief Propagation, variational methods) sacrifice convergence guarantees on frustrated topologies. We argue that this dichotomy stems from a mismatched inductive bias: iterative methods abandon the sequential elimination structure that makes exact inference correct. We introduce In-Context Graphical Inference (ICG-I), an autoregressive Graph Transformer that restores this structure by mimicking Variable Elimination with learned, Tensor- Train-compressed intermediate factors, paired with a Dirichlet output layer and Weighted Conformal Prediction for calibrated, distribution-free coverage guarantees under topological shift. We prove that TT compression errors propagate at most lincarly through the autoregressive chain, that the Dirichlet-Multinomial loss is a proper scoring rule, and that WCP maintains coverage with a quantifiable degradation under estimated density ratios. We conducted intensive experiments to evaluate ICG-I and achieved state-of-the-art performance across all benchmarks. ICG-I reduces MAE from 0.041 (best baseline) to 0.020 on standard instances and achieves 0.048 on N=500 frustrated spin glasses where BP diverges entirely.

2606.05035 2026-06-04 cs.CV

Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping

Anchor3R: 基于瞬态锚点的流式3D重建用于长时程视觉映射

Peilin Tao, Chong Cheng, Yuansen Du, Caiwei Song, Zhengqing Chen, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, Hainan Cui, Shuhan Shen

发表机构 * CASIA(中国科学院自动化研究所) UCAS(中国科学院自动化研究所) Horizon Robotics(Horizon机器人技术有限公司) HKUST(GZ)(香港科技大学(广州))

AI总结 提出Anchor3R框架,通过将前馈重建视为当前帧坐标系下的局部测量预测而非全局回归,结合窗口相对位姿预测、闭环插入和运动平均,实现长序列上的在线3D重建与位姿估计。

详情
AI中文摘要

长时程在线视觉映射是机器人感知的核心能力,需要在有限内存和计算下从视觉流中持续估计相机运动和场景几何。最近的前馈3D重建模型提供了强大的几何先验,但其流式变体通常在与第一帧或持久场景记忆绑定的固定坐标系中预测位姿。这种固定基准设计会导致训练-测试不匹配、对早期锚点的注意力偏差以及在远长于训练序列的序列上累积漂移。我们提出Anchor3R,一种流式3D重建框架,将前馈重建视为以当前为中心的局部测量预测,而非持久的全局基准回归。在每个时间步,Anchor3R预测窗口相对位姿和当前帧坐标系下的局部点图,将流式重建转化为相对位姿测量生成。这些测量支持在线位姿更新,而闭环插入和运动平均对齐轨迹并将局部点图转换为一致的全局重建。在室内、室外、驾驶和RGB-D基准上的实验表明,Anchor3R在长时程位姿精度和密集重建质量上优于现有流式基线,同时支持有限内存的在线推理。

英文摘要

Long-horizon online visual mapping is a core capability for robot perception, requiring continuous camera-motion and scene-geometry estimation from visual streams under bounded memory and computation. Recent feed-forward 3D reconstruction models provide strong geometric priors, but their streaming variants often predict poses in a fixed coordinate system tied to the first frame or a persistent scene memory. This fixed-gauge design leads to train--test mismatch, attention bias toward early anchors, and accumulated drift on sequences much longer than those seen during training. We propose \emph{Anchor3R}, a streaming 3D reconstruction framework that treats feed-forward reconstruction as current-centric local measurement prediction rather than persistent global-gauge regression. At each time step, Anchor3R predicts window-relative poses and a local pointmap in the current-frame coordinate system, turning streaming reconstruction into relative-pose measurement generation. These measurements support online pose updates, while loop-closure reinsertion and motion averaging align the trajectory and transform local pointmaps into a coherent global reconstruction. Experiments on indoor, outdoor, driving, and RGB-D benchmarks show that Anchor3R improves long-horizon pose accuracy and dense reconstruction quality over existing streaming baselines, while supporting bounded-memory online inference.

2606.05031 2026-06-04 cs.CV

MetaPoint: Unlocking Precise Spatial Control in Agentic Visual Generation

MetaPoint:在智能体视觉生成中实现精确空间控制

Dewei Zhou, Xinyu Huang, Xun Wang, Ji Xie, Yabo Zhang, Liang Li, Kunchang Li, Zongxin Yang, Yi Yang

发表机构 * Zhejiang University(浙江大学) ByteDance Seed(字节跳动种子) Harvard University(哈佛大学)

AI总结 提出MetaPoint方法,通过将连续2D坐标表示为单个特殊token,利用模型固有的位置编码实现像素级空间控制,无需修改架构。

详情
AI中文摘要

生成式视觉模型从根本上难以实现精确的空间控制。这源于一个核心脱节:模型可以处理空间的文本描述,但无法直接将数值坐标映射到2D图像画布上。我们引入了MetaPoint,一种通过将连续2D坐标表示为单个特殊token来弥合这一差距的方法。关键在于,MetaPoint不需要新的架构组件;它直接利用模型固有的位置编码方案来解释这些坐标,将我们的token视为画布上的一个虚拟点。这种轻量级方法能够用一个token实现对象位置的像素级控制,或用两个token实现边界框控制,而无需架构更改或定制注意力掩码。MetaPoint token被设计为可组合的,作为空间基元。这使得规划智能体能够将高级用户请求分解为结构化的基元序列,供生成器使用。通过提供一种简单、精确且可扩展的空间控制构建块,MetaPoint解锁了更强大的组合式生成智能体,并支持直观的交互式编辑系统。

英文摘要

Generative visual models fundamentally struggle with precise spatial control. This arises from a core disconnect: models can process textual descriptions of space but cannot directly map numerical coordinates onto the 2D image canvas. We introduce MetaPoint, a method that bridges this gap by representing a continuous 2D coordinate as a single, special token. Crucially, MetaPoint requires no new architectural components; it directly leverages the model's inherent positional encoding schemes to interpret these coordinates, treating our token as a virtual point on the canvas. This lightweight approach enables pixel-level control of an object's position with one token or its bounding box with two, all without requiring architectural changes or bespoke attention masking. The MetaPoint tokens are designed to be compositional, serving as spatial primitives. This allows a planner agent to decompose a high-level user request into a structured sequence of primitives for the generator. By providing a simple, precise, and scalable building block for spatial control, MetaPoint unlocks more powerful compositional generative agents and enables intuitive, interactive editing systems.

2606.05030 2026-06-04 cs.CL cs.SC

Imbuing Large Language Models with Bidirectional Logic for Robust Chain Repair

赋予大语言模型双向逻辑以进行稳健的链修复

Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford, UK(英国牛津大学计算机系) FLock.io Institute of Logic and Computation, TU Wien, Austria(奥地利技术大学逻辑与计算研究所)

AI总结 针对自回归链式推理中错误雪崩问题,提出Teleological Reasoning Infilling (TRI)框架,通过将错误推理段重构为填充中间任务并引入前缀-后缀-中间序列重排,结合符号验证器监督微调和直接偏好优化,实现仅修复受损段的高效链修复。

Comments 25 Pages

详情
Journal ref
In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2026
AI中文摘要

大型语言模型(LLMs)中的自回归链式推理(CoT)本质上是前向的:每一步仅依赖于先前的令牌。这种单向归纳偏差使得即使是能力强的模型也容易受到错误雪崩的影响,即早期步骤中的单个逻辑或算术错误会不可逆地破坏整个推理链。我们提出了Teleological Reasoning Infilling (TRI),一个训练框架,赋予仅解码器变换器原生的目标条件桥接能力。关键见解是将错误的推理段重构为填充中间(FIM)任务:给定一个验证过的前缀前提P、一个验证过的下游里程碑S和原始查询Q,模型必须综合出连接P到S的逻辑桥M,要求严格且完整。为了实现这一目标,我们引入了一种前缀-后缀-中间(PSM)序列重排,使用三个非重叠的哨兵令牌,使得M能够同时关注P和S,而无需对自注意力机制进行任何结构修改。训练分两个阶段进行:(i)在从形式数学语料库中提取的符号验证的(P, S, M)三元组上进行监督微调(SFT),以及(ii)以确定性符号验证器(Lean 4 / Python)作为唯一奖励神谕的直接偏好优化(DPO),消除了LLM评判的谄媚。在推理时,TRI作为双系统循环中的外科修复模块运行:因果草稿模型生成初始轨迹,验证器定位失败点,TRI仅填充受损段,保留已验证部分不变。在三个基准上的综合实验表明,TRI在所有任务上达到了最先进的性能,同时每个问题的令牌消耗减少了31.2%。

英文摘要

Autoregressive chain-of-thought (CoT) reasoning in large language models (LLMs) is fundamentally forward-directed: each step conditions only on prior tokens. This unidirectional inductive bias renders even capable models susceptible to error snowballing, wherein a single logical or arithmetic mistake in an early step irreversibly corrupts the entire reasoning chain. We introduce Teleological Reasoning Infilling (\TRI{}), a training framework that endows decoder-only transformers with a native \emph{goal-conditioned bridging} capability. The key insight is to reframe erroneous reasoning segments as fill-in-the-middle (FIM) tasks: given a verified prefix premise $P$, a verified downstream milestone $S$, and the original query $Q$, the model must synthesise the logical bridge $M$ that connects $P$ to $S$ rigorously and completely. To achieve this with standard causal architectures, we introduce a Prefix-Suffix-Middle (PSM) sequence rearrangement with three non-overlapping sentinel tokens, enabling $M$ to attend to both $P$ and $S$ without any structural modification to the self-attention mechanism. Training proceeds in two stages: (i) Supervised Fine-Tuning (SFT) on symbolically verified $(P, S, M)$ triples extracted from formal mathematics corpora, and (ii) Direct Preference Optimisation (DPO) with a deterministic symbolic verifier (Lean 4 / Python) as the sole reward oracle, eliminating LLM-judge sycophancy. At inference, TRI operates as a surgical repair module within a dual-system loop: a causal draft model generates an initial trace, the verifier pinpoints failures, and TRI infills only the damaged segment, leaving verified sections intact. Comprehensive experiments on three benchmarks demonstrate that TRI achieves state-of-the-art performance across all tasks, while reducing per-problem token expenditure by 31.2%.

2606.05029 2026-06-04 cs.LG cs.CL

Validity Threats for Foundation Model Research

基础模型研究的有效性威胁

Gunnar König, Martin Pawelczyk, Ulrike von Luxburg, Sebastian Bordt

发表机构 * University of Tübingen, Tübingen AI Center(图宾根大学,图宾根人工智能中心) University of Vienna(维也纳大学)

AI总结 本文提出一个因果推断评估框架,将基础模型研究中的不同近似实验策略(代理实验、观察性研究、单次运行设计)映射为四种有效性(统计、内部、外部、构念)的权衡,揭示并分析计算节省带来的隐蔽有效性威胁。

详情
AI中文摘要

受控实验是机器学习研究的基石,但在现代基础模型的规模下,它们变得过于昂贵。相反,研究界越来越依赖于以较低成本近似理想实验的研究策略:代理实验和缩放定律、使用公开模型的观察性研究,以及利用单个训练运行内部变化的单次运行设计。在这项工作中,我们认为在计算预算内近似大规模实验没有免费午餐。具体来说,计算节省是以有效性威胁为代价的——隐藏且有时无法检验的假设,当这些假设被违反时,会使研究主张无效。为了帮助应对这些威胁,我们提出了一个评估框架,将基础模型研究视为因果推断问题。在这个框架内,我们通过从经验社会科学中改编的四种有效性——统计、内部、外部和构念有效性——来评估不同的研究策略。我们发现每种策略都有其特有的有效性特征:代理实验以外部和构念有效性换取统计和内部有效性;观察性研究面临混杂和效应异质性;单次运行设计则因处理单元之间的干扰而紧张。这一分析揭示了文献中未得到充分关注的若干有效性威胁。总体而言,我们的评估框架为研究人员提供了一个实用的工具包,用于审视基础模型研究设计中的有效性威胁。

英文摘要

Controlled experiments are the backbone of machine learning research, but at the scale of modern foundation models, they have become prohibitively expensive. Instead, the community increasingly relies on research strategies that approximate the ideal experiment at a fraction of the cost: proxy experiments and scaling laws, observational studies with publicly available models, and single-run designs that leverage variation within individual training runs. In this work, we argue that there is no free lunch when approximating large-scale experiments on a compute budget. Specifically, savings in compute come at the cost of validity threats -- hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. To help navigate such threats, we propose an evaluation framework that casts foundation model research as a causal inference problem. Within this framework, we evaluate different research strategies through four types of validity adapted from the empirical social sciences -- statistical, internal, external, and construct validity. We find that each strategy comes with a characteristic validity profile: proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats that have received insufficient attention in the literature. Overall, our evaluation framework provides researchers with a practical toolkit for scrutinizing validity threats in foundation model research~designs.

2606.05025 2026-06-04 cs.LG cs.AI

Invariant Gradient Alignment for Robust Reasoning Distillation

不变梯度对齐用于鲁棒推理蒸馏

Zehua Cheng, Wei Dai, Jiahao Sun

发表机构 * University of Oxford(牛津大学) FLock.io

AI总结 提出不变梯度对齐(IGA)框架,通过逻辑同构集、连续梯度冲突掩码和截断SVD投影,对齐不同语义域但逻辑结构相同的梯度更新,提升大语言模型在分布外输入上的鲁棒性。

Comments 30 Pages

详情
Journal ref
In Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2026
AI中文摘要

大型语言模型(LLMs)存在捷径学习问题:它们在分布外(OOD)输入上系统性失败,这些输入的语义表面与训练数据不同,即使逻辑结构相同。这破坏了将思维链推理迁移到较小学生模型的知识蒸馏流程。我们引入不变梯度对齐(IGA),一种训练框架,通过三项创新对齐跨语义多样但逻辑同构示例的梯度更新:(i)逻辑同构集,即跨不同语义领域(数学、医学、法律、科学)共享相同逻辑结构的问题组;(ii)可微的连续梯度冲突掩码,抑制具有高跨域梯度方差的参数维度,同时保留不变方向;(iii)将掩码梯度通过截断SVD投影回LoRA低秩流形,保持参数效率。理论上,IGA比ERM产生更紧的OOD泛化界,随同构域数量缩放,并在温和正则条件下以标准SGD速率收敛。实验上,IGA在四个基准测试中优于八种基线,准确率提升高达14.3个百分点(相对于ERM-SFT),逻辑一致性得分为0.031对比0.142——表示不变性提升四倍。

英文摘要

Large language models (LLMs) suffer from shortcut learning: they systematically fail on out-of-distribution (OOD) inputs whose semantic surface differs from training data, even when the logical structure is identical. This undermines knowledge distillation pipelines that transfer chain-of-thought reasoning to smaller students. We introduce Invariant Gradient Alignment (IGA), a training framework that aligns gradient updates across semantically diverse but logically isomorphic examples via three innovations: (i) Logical Isomer Sets, groups of problems sharing identical logical structure across distinct semantic domains (mathematics, medicine, law, science); (ii) a differentiable \emph{Continuous Gradient Conflict Mask}, that suppresses parameter dimensions with high cross-domain gradient variance while preserving invariant directions; and (iii) a truncated SVD projection of the masked gradient back onto the LoRA low-rank manifold, maintaining parameter efficiency throughout. Theoretically, IGA yields tighter OOD generalization bounds than ERM, scaling with the number of isomer domains, and converges at the standard SGD rate under mild regularity. Empirically, IGA outperforms eight baselines across four benchmarks with accuracy gains up to 14.3 pp over ERM-SFT and a Logical Consistency Score of 0.031 versus 0.142 -- a fourfold improvement in representational invariance.

2606.05021 2026-06-04 cs.LG

Enhancing the MADDPG Algorithm for Multi-Agent Learning via Action Inference and Importance Sampling

通过动作推理和重要性采样增强多智能体学习的MADDPG算法

Marc Walden, Jason Liu, Shaashwath Sivakumar, Ryan Liu, Hamza Khan

发表机构 * Department of Mathematics, University of California Los Angeles, Los Angeles, CA, USA(加州大学洛杉矶分校数学系)

AI总结 针对多智能体深度强化学习,提出动作推理机制和基于几何分布的重要性采样策略来改进MADDPG算法,在离散动作捕食者-猎物任务中提升了学习稳定性、智能体间协作和探索效率。

详情
AI中文摘要

我们研究了多智能体深度强化学习,并提出了对多智能体深度确定性策略梯度(MADDPG)算法的两项增强。首先,我们引入了一种新颖的动作推理机制,使每个智能体能够预测其他智能体的预期动作,从而提高其自身策略的准确性和稳定性。其次,我们在回放缓冲区中应用了基于几何分布的重要性采样策略,以优先考虑更近期和更具信息性的经验,这有助于缓解多智能体环境中固有的非平稳性。我们在PettingZoo库提供的离散动作捕食者-猎物任务上评估了这两项修改,PettingZoo是一个用于通用多智能体强化学习基准测试的灵活Python接口。我们的结果表明,动作推理在提高学习稳定性和智能体间协作方面是有效的,并且使用几何分布的重要性采样可以在探索效率上比标准MADDPG带来显著改进。代码可在https://github.com/shaashwathsivakumar/MARL_Proj获取。

英文摘要

We investigate multi-agent deep reinforcement learning and propose two enhancements to the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. First, we introduce a novel Action Inference mechanism that enables each agent to predict other agents' intended actions, thereby improving the accuracy and stability of its own policy. Second, we apply an importance sampling strategy, using geometric distribution, in the replay buffer to prioritize more recent and informative experiences, which helps mitigate the non-stationarity inherent in multi-agent environments. We evaluate both modifications on the discrete-action Predator-Prey task provided by the PettingZoo library, a flexible Python interface for general multi-agent reinforcement learning benchmarks. Our results indicate that Action Inference is effective in improving learning stability and inter-agent cooperation and that importance sampling using geometric distribution can lead to significant improvements in exploration efficiency over standard MADDPG. Code available at https://github.com/shaashwathsivakumar/MARL_Proj

2606.05018 2026-06-04 cs.CV

Handwriting Extraction and Analysis of Signature Lists in Swiss Popular Initiatives

瑞士民众倡议中签名列表的手写提取与分析

Marco Peer, Thomas Gorges, Mathias Seuret, Vincent Christlein, Andreas Fischer

发表机构 * AIBEX Group, University of Fribourg(AIBEX集团,弗里堡大学) Pattern Recognition Lab, FAU Erlangen-Nürnberg(模式识别实验室,埃尔兰根-纽伦堡大学)

AI总结 针对瑞士民众倡议中签名列表验证的繁重人工流程,提出结合模板行分割、OCR和基于AI的手写分析(特别是作者检索)的自动化管道,实验表明OCR对短文本识别率低(CER 29.6%),而作者检索mAP达50.6%,可有效支持重复提交检测。

Comments Accepted for presentation at ICCST 2026

详情
AI中文摘要

民众倡议和公投是瑞士民主的核心,然而手写签名列表的验证仍然是一个劳动密集型的手工过程。本文研究了自动化文档分析方法的潜力,包括OCR和基于AI的手写分析,以支持这一任务。我们提出了一种结合基于模板的行分割与文本识别和作者检索技术的流水线,并在包含418位作者的443条手写条目的数据集上进行了评估。结果表明,OCR在处理词汇表外的手写文本时表现不佳,名字的词错误率(CER)为29.6%。相比之下,作者检索表现更为稳健,平均精度(mAP)达到50.6%。此外,我们的实验表明,现成的OCR系统对于手写签名数据的转录不够可靠,尤其是对于姓名或地址等短且词汇表外的条目。然而,作者检索方法可以有效地识别签名列表中视觉相似的条目,使其成为基于手写相似性支持检测潜在重复提交的合适工具。

英文摘要

Popular initiatives and referendums are central to Swiss democracy, yet the validation of handwritten signature lists remains a labor-intensive manual process. This paper investigates the potential of automated document analysis methods, including OCR and AI-based handwriting analysis, to support this task. We propose a pipeline combining template-based line segmentation with text recognition and writer retrieval techniques, evaluated on a dataset of 443 handwritten entries from 418 writers. Results show that OCR struggles with out-of-vocabulary handwriting, with a CER of 29.6% for first names. In contrast, writer retrieval performs more robustly, reaching an mAP of 50.6%. Furthermore, our experiments indicate that off-the-shelf OCR systems are not sufficiently reliable for transcription of handwritten signature data, particularly for short, out-of-vocabulary entries such as names or addresses. However, writer retrieval methods can effectively identify visually similar entries across signature lists, making them a suitable tool for supporting the detection of potential duplicate submissions based on handwriting similarity.

2606.05016 2026-06-04 cs.CL

TaDA: Calibrated Probe Gating for Task-Domain LoRA Merging

TaDA: 任务-领域LoRA合并的校准探针门控

Huy Quoc To, Fuyi Li, Guangyan Huang, Ming Liu

发表机构 * Deakin University(德克萨斯大学) Adelaide University(阿德莱德大学)

AI总结 针对任务与领域LoRA适配器合并中的深度不对称性,提出无训练算法TaDA,通过校准探针引导的逐层门控和逐分量子空间感知合并,在六个科学QA和六个图像分类基准上取得最优性能。

详情
AI中文摘要

将任务LoRA适配器与领域LoRA适配器组合成一个统一模型是一个实际但很大程度上未被探索的挑战。现有方法将两个适配器视为对称对等体,对所有层应用统一权重。我们认为,任务和领域适配器在Transformer架构中表现出一致的深度依赖不对称性。领域主导性随层深度增加而增强,而较浅层保留更强的任务相关信号。受此观察启发,我们提出$ extbf{TaDA}$($ extbf{Ta}$sk-$ extbf{D}$omain LoR$ extbf{A}$ Merging),一种无训练算法,通过校准探针引导的逐层门控和逐分量子空间感知合并来利用这种结构。门控使用被证明对适配器权重幅度不变的探针信号,为每层和投影类型分配独立权重。合并则在组合剩余分量之前丢弃冲突的奇异方向。$ extbf{TaDA}$产生一个标准秩$r$的LoRA适配器,推理开销为零。在Llama-2-7B的六个科学QA基准上,TaDA平均准确率达到0.452,比DARE-TIES高出3.6个百分点,并在所有六个基准上取得最佳结果。在ViT-L/16的六个图像分类基准上,TaDA平均准确率达到85.9%,在六个基准中的三个上领先,同时优于最强的合并基线。

英文摘要

Combining a task LoRA adapter with a domain LoRA adapter into a single unified model is a practical yet largely unexplored challenge. Existing methods treat both adapters as symmetric peers, applying uniform weights across all layers. We argue that task and domain adapters exhibit a consistent depth-dependent asymmetry across transformer architectures. Domain dominance increases with layer depth, while shallower layers retain stronger task-relevant signals. Motivated by this observation, we propose $\textbf{TaDA}$ ($\textbf{Ta}$sk-$\textbf{D}$omain LoR$\textbf{A}$ Merging), a training-free algorithm that exploits this structure through calibrated probe-guided per-layer gating and per-component subspace-aware merging. The gating assigns individual weights per layer and projection type using a probe signal proved invariant to adapter weight magnitude. The merging discards conflicting singular directions before combining the remaining components. $\textbf{TaDA}$ produces a standard rank-$r$ LoRA adapter with zero inference overhead. On six scientific QA benchmarks with Llama-2-7B, TaDA achieves an average accuracy of 0.452, outperforming DARE-TIES by +3.6 percentage points and obtaining the best result on all six benchmarks. On six image classification benchmarks with ViT-L/16, TaDA reaches 85.9\% average accuracy, improving over the strongest merging baseline while leading in three of the six individual benchmarks.

2606.05015 2026-06-04 cs.RO

Generalization of World Models under Environmental Variability for Vision-based Quadrotor Navigation

环境变异性下基于视觉的四旋翼导航的世界模型泛化

Luca Zanatta, Grzegorz Malczyk, Kostas Alexis

发表机构 * Norwegian University of Science and Technology(挪威科学与技术大学)

AI总结 通过基于视觉的四旋翼导航测试,研究世界模型在不同环境随机性下的鲁棒性,发现自监督预训练阶段的泛化能力是模拟到现实迁移的强预测因子,并识别出离散潜在大小和训练序列长度是关键因素。

详情
AI中文摘要

世界模型,即学习预测环境演化的生成模型,已成为样本高效机器人学习的有前景工具。然而,它们对环境变异性的鲁棒性仍知之甚少。为解决这一问题,我们以基于视觉的四旋翼导航为测试平台进行系统研究,在不同环境随机性水平下训练基于DreamerV3的世界模型,并通过跨环境验证(涵盖自监督学习预训练和强化学习微调)在所有水平上评估它们。然后,我们将所有世界模型及相关导航策略部署到真实四旋翼上,在未见环境中进行测试,包括一次开环运行,其中模型仅接收2.5秒的真实感官输入,之后所有传感器被切断,系统完全依靠想象导航穿越12米距离。结果表明,自监督预训练阶段的世界模型鲁棒性是模拟到现实迁移的强预测因子:在跨环境自监督验证中泛化良好的每个模型都成功部署到真实世界,通过窄至0.67米的间隙,而在模拟策略评估中占主导地位的模型却在真实平台上失败。我们进一步识别出(a)离散潜在大小和(b)训练序列长度是控制世界模型质量的主要因素。

英文摘要

World models, learned generative models that predict how an environment evolves, have become a promising tool for sample-efficient robot learning. Yet how robust they are to environmental variability remains poorly understood. To address this, we conduct a systematic study using vision-based quadrotor navigation as a testbed problem, training DreamerV3-based world models under varying levels of environmental randomness and evaluating them across all levels through cross-environment validation, spanning both Self-Supervised Learning (SSL) pretraining and Reinforcement Learning (RL) fine-tuning. We then deploy all world models and associated navigation policies on a real quadrotor in unseen environments, including an open-loop run where the model receives just 2.5s of real sensory input before all sensors are cut off, leaving the system to navigate entirely in imagination over a 12m traverse. Our results show that world model robustness during SSL pretraining is a strong predictor of sim-to-real transfer: every model that generalized well in cross-environment SSL validation deployed successfully in the real world, passing through gaps as narrow as 0.67m, whereas the model that dominated simulation policy evaluation failed on the real platform. We further identify (a) the discrete latent size and (b) the training-sequence length as the dominant factors governing world model quality.

2606.05011 2026-06-04 cs.CV cs.RO

CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation

CIPER: 跨视图图像检索与姿态估计的统一框架

Yurim Jeon, Dongseong Seo, Seung-Woo Seo

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出CIPER框架,通过共享Transformer编码器和任务特定令牌联合进行城市级跨视图检索与精确3自由度姿态估计,实现互惠特征学习。

Comments 16 pages, 5 figures

详情
AI中文摘要

跨视图地理定位通过将地面图像与航拍图像数据库匹配来估计其地理位置。现有方法要么通过大规模检索,要么通过精确姿态估计来处理,但无法兼顾:基于检索的方法能够进行广域搜索,但牺牲了定位精度;而姿态估计方法仅在狭窄的搜索空间内实现高精度。简单级联这些流程会导致误差传播和特征表示不一致。我们将跨视图地理定位形式化为一个统一问题,要求同时进行城市级检索和精确的3自由度姿态估计。我们提出CIPER(跨视图图像检索与姿态估计变换器),这是一种单一架构,通过互惠特征学习联合执行两项任务。CIPER使用共享的Transformer编码器和任务特定令牌,将全局检索特征与空间定位线索分离。为了弥合地面和航拍视图之间的大领域差距,我们引入了一个双向Transformer姿态解码器,该解码器使用地面特征作为空间查询进行双向交叉注意力。一种集合预测策略进一步在统一的多任务目标下实现稳定的3自由度回归。在VIGOR、KITTI和Ford Multi-AV上的实验表明,特别是在有限的视野和任意方向条件下,性能具有竞争力。代码可在https://github.com/yurimjeon1892/CIPER获取。

英文摘要

Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of localization accuracy, while pose estimation methods achieve high precision within only a narrow search space. Naively cascading these pipelines introduces error propagation and inconsistent feature representations. We formulate cross-view geo-localization as a unified problem requiring simultaneous city-scale retrieval and precise 3-DoF pose estimation. We propose CIPER (Cross-view Image-retrieval and Pose-estimation transformER), a single architecture that jointly performs both tasks through mutually beneficial feature learning. CIPER uses a shared transformer encoder with task-specific tokens to disentangle global retrieval features from spatial localization cues. To bridge the large domain gap between ground and aerial views, we introduce a two-way transformer pose decoder that uses ground features as spatial queries for bidirectional cross-attention. A set prediction strategy further enables stable 3-DoF regression under a unified multi-task objective. Experiments on VIGOR, KITTI, and Ford Multi-AV demonstrate competitive performance, especially under limited field-of-view and arbitrary orientation conditions. Code is available at https://github.com/yurimjeon1892/CIPER.

2606.05009 2026-06-04 cs.CL cs.AI

DAR: Deontic Reasoning with Agentic Harnesses

DAR: 基于智能体框架的道义推理

Guangyao Dou, William Jurayj, Nils Holzenberger, Benjamin Van Durme

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Télécom Paris, Institut Polytechnique de Paris(巴黎电信学院,巴黎理工学院)

AI总结 提出DAR框架,通过让模型按需与法规交互来提升基于LLM的道义推理能力,实验表明智能体框架可提升性能但存在非均匀改进和弱模型数值任务退化问题。

详情
AI中文摘要

道义推理是通过将明确的规则和政策应用于具体案例事实来回答问题,例如根据法规计算纳税义务或确定移民上诉结果。基于LLM的道义推理的一个关键技术挑战是相关规则集可能很长且相互引用,因此模型可能仍无法找到特定推理步骤所需的规则。我们引入了道义智能体推理(DAR),这是一种智能体推理设置,其中模型按需与法规交互。我们在DeonticBench的困难子集上使用多种框架评估DAR。在这些设置中,我们发现智能体框架可以推动道义推理任务的前沿,但改进并不均匀:较弱的模型在数值任务上往往性能下降,同时消耗更多的令牌。

英文摘要

Deontic reasoning is the task of answering questions by applying explicit rules and policies to case-specific facts, for example computing tax liability under a statute or determining the outcome of an immigration appeal. A key technical challenge for LLM-based deontic reasoning is that the relevant ruleset can be long and cross-referenced, so models may still fail to locate the rules needed for a particular reasoning step. We introduce Deontic Agentic Reasoning (DAR), an agentic reasoning setup in which the model interacts with the statutes on demand. We evaluate DAR under multiple harnesses on hard subsets of DeonticBench. Across these settings, we find that agentic harnesses can push the frontier on deontic reasoning tasks, but improvements are not uniform: weaker models often degrade on numerical tasks while consuming far more tokens.

2606.05008 2026-06-04 cs.CV cs.AI cs.CL

M$^3$Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

M$^3$Eval: 通过认知基础视频任务的多模态记忆评估

Jie Huang, Ruixun Liu, Sirui Sun, Xinyi Yang, Yin Li, Yixin Zhu, Yiwu Zhong

发表机构 * School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院) State Key Laboratory of General Artificial Intelligence, Peking University(北京大学通用人工智能国家重点实验室) Yuanpei College, Peking University(北京大学元培学院) Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) School of Psychological and Cognitive Sciences, Peking University(北京大学心理学与认知科学学院) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出首个多模态模型记忆评估框架M$^3$Eval,通过认知心理学设计的视频任务系统评估模型在记忆保持、忠实性和鲁棒性上的表现,发现模型在并行视频流处理、干扰模式、时空记忆和符号记忆方面的显著缺陷。

Comments We present an evaluation designed for multi-modal memory in multi-modal models

详情
AI中文摘要

随着多模态模型向长视频理解发展,记忆成为关键能力。尽管在视频数据集和基准测试方面做出了大量努力,现有工作主要关注感知和推理,而没有系统评估记忆:模型保留了什么、信息如何忠实保存、以及记忆在干扰下的鲁棒性。为填补这一空白,我们引入了M$^3$Eval,这是第一个用于探测多模态模型中不同记忆维度的综合评估框架和基准。基于认知心理学,我们的设计通过精心构建的任务来隔离记忆的关键方面。利用M$^3$Eval,我们在代表性多模态模型上进行了大量实验,揭示了一致的弱点和独特行为。我们发现,模型在处理并行视频流时难以保持解耦表示,表现出与人类记忆显著不同的干扰模式,在空间域比时间域更可靠地定位记忆源,并且符号记忆有限。总的来说,我们的基准为未来研究提供了宝贵资源,而我们的发现强调了记忆作为基本但未充分探索的能力,并为设计更有效的多模态模型记忆机制提供了见解。我们的代码和数据集可在https://pku-value-lab.github.io/m3eval-homepage获取。

英文摘要

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M$^3$Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M$^3$Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

2606.05002 2026-06-04 cs.CL

GARL: Game-Theoretic Reinforcement Learning for Multi-Agent Strategic Prioritisation

GARL:面向多智能体战略优先级排序的博弈论强化学习

Yuxiao Ye, Yiwen Zhang, Huiyuan Xie, Yuqin Huang, Zhiyuan Liu

发表机构 * Tsinghua University(清华大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出GARL框架,将多智能体战略优先级排序形式化为两阶段博弈,通过博弈论效用转化为角色特定强化信号,优化交互策略,在争议问题排序任务中提升性能并使小型开源LLM与强闭源LLM竞争。

详情
AI中文摘要

基于LLM的多智能体系统越来越多地用于战略决策任务。在此类设置中,性能不仅取决于单个模型的能力,还取决于智能体交互和适应的策略。多智能体强化学习可以优化这些交互策略,但其奖励设计通常特定于任务且与交互结构的关联较弱。为弥补这一差距,我们提出GARL,一种面向多智能体战略优先级排序的博弈论强化学习框架。GARL将战略优先级排序形式化为两阶段博弈:竞争智能体首先在共享候选集上分配战略资源,然后更高级别的仲裁者产生最终排名。由此产生的博弈论效用被转化为角色特定的强化信号,使策略优化能够由结构化交互引导。我们在争议问题排序任务上实例化GARL,其目标是在法律程序中优先处理核心问题。实验表明,GARL提高了排序性能,使小型开源LLM在相同候选排名设置下与强大的闭源LLM竞争,并在法律领域能力和更广泛的战略决策方面取得收益。总体而言,GARL展示了如何将博弈论交互结构转化为强化学习目标,为多智能体战略优先级排序中的策略优化提供了原则性方法。

英文摘要

LLM-based multi-agent systems are increasingly used for strategic decision-making tasks. In such settings, performance depends not only on individual model capabilities, but also on the policies by which agents interact and adapt. Multi-agent reinforcement learning can optimise these interaction policies, but its reward design often remains task-specific and weakly grounded in interaction structure. To address this gap, we propose GARL, a GAme-theoretic Reinforcement Learning framework for multi-agent strategic prioritisation. GARL formalises strategic prioritisation as a two-stage game: competing agents first allocate strategic resources over a shared candidate set, and a higher-level arbiter then produces the final ranking. The resulting game-theoretic utilities are converted into role-specific reinforcement signals, allowing policy optimisation to be guided by structured interaction. We instantiate GARL on issues-in-dispute ranking, where the goal is to prioritise core issues in legal proceedings. Experiments show that GARL improves ranking performance, enables small open-source LLMs to become competitive with a strong closed-source LLM under the same candidate-ranking setting, and yields gains in legal-domain competence and broader strategic decision-making. Overall, GARL demonstrates how game-theoretic interaction structure can be turned into reinforcement-learning objectives, providing a principled approach to policy optimisation in multi-agent strategic prioritisation.

2606.04994 2026-06-04 cs.LG q-bio.QM

New Benchmarking Shows Limited Generalization Power of TCR Antigenic Epitope Prediction Models

新基准测试显示TCR抗原表位预测模型的泛化能力有限

Yiming Liao, Yiheng Li, Ning Jiang, Bo Li, Keke Chen

发表机构 * Trustworthy and Intelligent Computing Lab (TAIC), Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County(可信智能计算实验室(TAIC),计算机科学与电气工程系,马里兰大学巴尔的摩分校) Children’s Hospital of Philadelphia(费城儿童医院) Department of Bioengineering, University of Pennsylvania(生物工程系,宾夕法尼亚大学) Institute for Immunology & Immune Health, University of Pennsylvania(免疫学与免疫健康研究所,宾夕法尼亚大学) Institute for RNA Innovation, University of Pennsylvania(RNA创新研究所,宾夕法尼亚大学) Abramson Cancer Center, University of Pennsylvania(Abramson癌症中心,宾夕法尼亚大学) Center for Precision Engineering for Health, University of Pennsylvania(健康精准工程中心,宾夕法尼亚大学) Center for Cellular Immunotherapies, University of Pennsylvania(细胞免疫治疗中心,宾夕法尼亚大学)

AI总结 本文通过构建两类严格定义的未见基准数据集,评估了T细胞受体(TCR)抗原特异性预测模型的性能,发现现有模型泛化能力有限,并提出了改进框架。

Comments 6 pages, 1 figure. Preprint version

详情
AI中文摘要

准确计算预测T细胞受体(TCR)抗原特异性将改变T细胞生物学研究,并实现可扩展的免疫工程,但现有模型缺乏足够的灵敏度和特异性,难以广泛应用。一个主要限制是缺乏严格定义的、未见过的基准数据集,无法对模型性能和泛化能力进行无偏评估。在此,我们描述了两类满足此标准的互补数据集,并认为它们既为模型评估提供了稳健框架,也为下一代TCR-抗原预测算法的开发奠定了基础。

英文摘要

Accurate computational prediction of T cell receptor (TCR) antigen specificity would transform the study of T cell biology and enable scalable immune engineering, yet existing models lack sufficient sensitivity and specificity for broad applications. A major limitation is the absence of rigorously defined, unseen benchmark datasets that allow unbiased evaluation of model performance and generalizability. Here, we describe two complementary classes of datasets that meet this criterion and argue that they provide both a robust framework for model assessment and a foundation for next-generation TCR-antigen prediction algorithm development.

2606.04992 2026-06-04 cs.CV cs.HC

Multi-Camera AR Guidance System for Surgical Instrument Handling and Assembly: Investigating Workload and Efficiency

用于手术器械操作与组装的 multi-camera AR 引导系统:工作量和效率研究

Shiyu Li, Julian Kreimeier, Hannah Schieber, Dirk Müller, Bernhard Kainz, Rüdiger von Eisenhart-Rothe, Daniel Roth

发表机构 * NVIDIA Deep Learning Data Synthesizer(NVIDIA深度学习数据合成器) NVIDIA Scene Imaging Interface(NVIDIA场景成像接口)

AI总结 提出一种无标记的多摄像头增强现实引导系统,结合6D位姿估计和头戴显示,显著降低手术器械操作的工作量并提高效率。

Comments 11 pages

详情
AI中文摘要

手术中器械的操作和组装对洗手护士提出了很高的认知要求,尤其是在器械不熟悉的情况下。我们提出了一种支持性的手术器械引导系统,该系统结合了多摄像头6D位姿估计和头戴显示器上的增强现实原位可视化,无需额外标记。位姿估计和连续的相机校准通过已知物体实现。6D位姿估计网络仅使用合成数据进行训练,旨在获得更好的泛化能力和实际应用性。AR引导显示工具提示定位线索和逐步组装动画。通过基于注视的选择和脚踏板,用户可以在术中操作中切换组装步骤。在技术评估中,我们的方法优于最先进的6D位姿估计。在膝关节置换术的手术模拟中,对29名洗手护士进行了用户研究,将系统与纸质手册进行了比较。AR引导显著降低了感知工作量。客观上,AR引导将任务完成时间减少了21.3%(4.76分钟)。特别是,对器械组不太熟悉的洗手护士在使用该系统时受益。两种条件下的错误频率相当。定性反馈强调了过程清晰度提高、信息过载减少和感知独立性。总之,我们的无标记多摄像头AR引导方法可以在主观和客观上改善术中器械操作性能,特别是对于未经培训的洗手护士。

英文摘要

The handling and assembly of instruments during surgery imposes high cognitive demands on scrub nurses, particularly when instruments are unfamiliar. We present a supporting guidance system for surgical instrumentation that combines multi-camera 6D pose estimation with augmented reality in-situ visualization on a head-mounted display without the requirement for additional markers. Pose estimation and consecutive camera calibration are achieved through known objects. The 6D pose estimation network is trained purely on synthetic data, aiming for better generalizability and real-world applicability. The AR guidance displays tooltip localization cues and step-wise assembly animations. Via gaze-based selection and a foot pedal, users can switch between assembly steps in intraoperative use. In a technical evaluation, our approach outperforms state-of-art 6D pose estimation. A user study with 29 scrub nurses was conducted in a surgical simulation of knee arthroplasty, comparing the system against a paper manual. AR guidance significantly reduced the perceived workload compared. Objectively, AR guidance reduced task completion time by 21.3\% (4.76 minutes). Specifically, scrub nurses less experienced with the instrument set benefited when using the system. Error frequencies were comparable between conditions. Qualitative feedback highlighted improved process clarity, reduced information overload, and perceived independence. To summarize, our marker-free multi-camera AR guidance approach for surgical instruments can, subjectively and objectively, improve intraoperative instrumentation performance, particularly for untrained scrub nurses.

2606.04987 2026-06-04 cs.CL cs.AI cs.HC

DeliChess: A Multi-party Dialogue Dataset for Deliberation in Chess Puzzle Solving

DeliChess: 一个用于国际象棋谜题求解中深思熟虑的多方对话数据集

Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos

发表机构 * University of Cambridge(剑桥大学) University of Sheffield(谢菲尔德大学)

AI总结 提出DeliChess数据集,包含多方协作解决国际象棋谜题的对话,通过讨论显著提升群体准确性,并分析探询性话语的作用。

详情
AI中文摘要

多方对话是研究协作推理和决策的关键场景,然而现有数据集很少关注结构化、深入的复杂推理任务。我们引入了DeliChess,一个新颖的群体深思熟虑对话数据集,其中参与者协作解决多项选择国际象棋谜题。每个小组首先单独完成谜题,然后进行多方讨论,最后提交修正后的集体答案。该数据集包含107个对话,附有完整转录、讨论前后的选择以及关于谜题难度和走棋质量的元数据。我们使用基于象棋引擎评估的三个指标评估性能,发现深思熟虑显著提高了群体准确性。我们进一步利用先前深思熟虑数据训练的分类器分析了探询性话语(即引发提议、理由或战略反思的消息)的作用。虽然探询性话语使讨论后的群体表现更加多变,但它并未持续带来更好的性能。我们的数据集为在一个明确定义的策略领域中建模群体推理、对话动态以及不同观点和意见的解决提供了丰富的测试平台。

英文摘要

Multi-party dialogue is a critical setting for studying collaborative reasoning and decision-making, yet existing datasets rarely focus on structured, in-depth complex reasoning tasks. We introduce DeliChess, a novel dataset of group deliberation dialogues in which participants collaboratively solve multiple-choice chess puzzles. Each group first completes the puzzle individually, then engages in a multi-party discussion before submitting a revised collective answer. The dataset includes 107 dialogues with full transcripts, pre- and post-discussion choices, and metadata on puzzle difficulty and move quality. We evaluate performance using three metrics based on chess engine evaluations, and find that deliberation significantly improves group accuracy. We further analyse the role of probing utterances (i.e., messages that elicit proposals, justifications, or strategic reflection) using a classifier trained on prior deliberation data. While probing makes group performance more variable after discussion, it does not consistently lead to better performance. Our dataset offers a rich testbed for modelling group reasoning, dialogue dynamics, and the resolution of differing perspectives and opinions in a well-defined strategic domain.

2606.04986 2026-06-04 cs.CV

Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning

Food-R1: 一种基于强化学习的统一多任务食品视觉语言模型

Yu Zhu, Yongkang Li, Wenjie Zhu, Haoyi Jiang, Wenyu Liu, Wei Yang, Bin Li, Xinggang Wang

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 针对现有食品视觉语言模型依赖监督微调导致推理和泛化能力受限以及营养标注稀缺的问题,提出包含链式思维标注的大规模基准CalorieBench-80K和基于强化微调(GRPO)的统一多任务食品视觉语言模型Food-R1,在食品相关任务上持续超越强基线。

详情
AI中文摘要

最近的研究探索了将视觉语言模型(VLM)用于食品分析。然而,现有方法主要依赖于监督微调(SFT),这通常限制了推理和泛化能力。此外,高质量的大规模营养标注仍然稀缺。为了解决这些问题,我们引入了CalorieBench-80K,一个包含精心整理的卡路里标签和饮食建议注释的大规模基准。据我们所知,这是第一个包含链式思维(CoT)注释用于卡路里推理的食品图像基准。我们还提出了Food-R1,一个在多任务学习范式中训练的统一食品VLM,以赋予模型广泛的能力。Food-R1经过基于CoT的冷启动指令微调,然后使用组相对策略优化(GRPO)进行强化微调(RFT),以提高推理和性能。在CalorieBench-80K和代表性基准上的实验表明,Food-R1在食品相关任务上持续优于强基线。代码、模型权重和基准注释可在项目仓库中获得。

英文摘要

Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations remain scarce. To address these issues, we introduce CalorieBench-80K, a large-scale benchmark with curated calorie labels and dietary advice annotations. To the best of our knowledge, it is the first food image benchmark to incorporate Chain-of-Thought (CoT) annotations for calorie reasoning. We also propose Food-R1, a unified food VLM trained in a multi-task learning paradigm to equip the model with broad capabilities. Food-R1 undergoes CoT-based cold-start instruction tuning, followed by reinforcement fine-tuning (RFT) using Group Relative Policy Optimization (GRPO) to improve reasoning and performance. Experiments on CalorieBench-80K and representative benchmarks show that Food-R1 consistently outperforms strong baselines across food-related tasks. The code, model weights, and benchmark annotations are available at the project repository.

2606.04980 2026-06-04 cs.LG

AlphaQ: Calibration-Free Bit Allocation for Mixture-of-Experts Quantization

AlphaQ: 混合专家量化的免校准位分配

Wanqi Yang, Yuexiao Ma, Alexander Conzelmann, Xiawu Zheng, Michael W. Mahoney, T. Konstantin Rusch, Shiwei Liu

发表机构 * Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) ELLIS Institute Tübingen(图宾根ELLIS研究所) Tübingen AI Center(图宾根人工智能中心) Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Xiamen University(厦门大学多媒体可信感知与高效计算重点实验室) International Computer Science Institute(国际计算机科学研究所) Lawrence Berkeley National Laboratory(伯克利国家实验室) University of California, Berkeley(加州大学伯克利分校) Liquid AI

AI总结 针对混合专家模型量化中依赖校准数据导致位分配次优的问题,提出基于重尾自正则化理论的免校准位分配方法AlphaQ,通过专家权重谱的重尾程度分配位宽,在预算约束下最小化量化误差,实现接近全精度的性能。

Comments 28 pages, 11 figures

详情
AI中文摘要

混合专家(MoE)架构通过稀疏专家激活扩展模型容量,但其部署仍受内存限制,因为所有专家权重必须驻留在内存中。混合精度量化通过为不同专家分配不同位宽,可以显著减少内存占用。然而,现有方法通常依赖校准数据来估计专家重要性并确定位分配。对于前沿的MoE大语言模型,原始训练数据(即真实训练分布)是专有的且不可访问。因此,校准集不可避免地成为不完美的替代品,这可能导致对专家利用率的错误估计和次优的位分配。受现代MoE模型中观察到的显著跨专家质量差异,以及重尾自正则化(HT-SR)理论在无需训练或测试数据的情况下成功预测神经网络模型质量的启发,我们提出了AlphaQ,一种用于MoE量化的免校准位分配方法。AlphaQ借鉴HT-SR理论,遵循一个简单原则:具有更重尾权重谱的专家通常训练得更好,因此应获得更高的位宽,而重尾结构较弱的专家可以更激进地量化。AlphaQ通过测量专家级别的谱重尾程度,并求解在全局位预算约束下最小化总量化误差的预算约束优化问题来实现这一原则。在多个MoE模型上,AlphaQ在匹配位预算下始终优于基于校准的基线方法。值得注意的是,在Qwen1.5-MoE上,AlphaQ在平均专家精度仅为3.5位的情况下实现了接近全精度的准确率,同时提供了超过4倍的内存压缩。我们的代码可在https://github.com/Superone77/AlphaQ获取。

英文摘要

Mixture-of-Experts (MoE) architectures scale model capacity through sparse expert activation, but their deployment remains memory-bound because all expert weights must reside in memory. Mixed-precision quantization can substantially reduce this footprint by assigning different bit-widths to different experts. Existing approaches, however, typically rely on calibration data to estimate expert importance and determine bit allocation. For frontier MoE LLMs, the original training data, and hence the true training distribution, is proprietary and inaccessible. As a result, calibration sets are inevitably imperfect surrogates, and this can misestimate expert utilization and lead to suboptimal bit allocation. Motivated by the substantial cross-expert quality variability observed in modern MoE models, and by the success of Heavy-Tailed Self-Regularization (HT-SR) theory at predicting neural network model quality without access to training or testing data, we propose AlphaQ, a calibration-free bit-allocation method for MoE quantization. AlphaQ draws on HT-SR theory and follows a simple principle: experts with more heavy-tailed weight spectra are typically better trained and hence should receive higher bit-widths, while experts with weaker heavy-tailed structure can be quantized more aggressively. AlphaQ operationalizes this principle by measuring expert-wise spectral heavy-tailedness and solving a budget-constrained optimization problem that minimizes total quantization error under a global bit-budget constraint. Across several MoE models, AlphaQ consistently outperforms calibration-based baselines under matched bit budgets. Notably, on Qwen1.5-MoE, AlphaQ achieves near full-precision accuracy with an average expert precision of only 3.5 bits, while delivering more than 4$\times$ memory compression. Our code is available at https://github.com/Superone77/AlphaQ.

2606.04978 2026-06-04 cs.CL cs.CY econ.GN q-fin.EC

Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game

探究大语言模型风险决策中的结果层面相似性与机制层面一致性:来自圣彼得堡博弈的证据

Chensong Huang, Changyu Chen, Chenwei Lin, Hanjia Lyu, Xian Xu, Jiebo Luo

发表机构 * Fudan University(复旦大学) University of Rochester(罗切斯特大学)

AI总结 通过圣彼得堡博弈实验,发现大语言模型在风险决策中表现出结果层面的类人行为,但机制层面与人类决策存在显著差异,提示行为对齐可能仅停留在表面。

详情
AI中文摘要

大语言模型在风险决策任务中可能显得谨慎,但看似谨慎的输出并不一定表明其与人类决策机制对齐。我们以圣彼得堡博弈作为受控测试平台来研究这一区别,这是一个经典悖论,其中期望收益无限,但人类通常报告低且有限的支付意愿。我们评估了28个大语言模型,使用结构化的提示套件,包括原始博弈;控制决策变体,扰动截断、重复游戏、数字禀赋和职业身份;要求模型以人类决策者身份推理的人类视角提示;以及基础模型与其指令微调对应模型之间的配对比较。在原始博弈中,大多数模型生成有限出价,造成类人风险行为的表象。然而,这种结果层面的相似性掩盖了显著的机制层面差异。控制变体揭示,模型并未保持原始博弈中观察到的类人行为,而是常常转向条件性和计算性理性行为。人类线索提示和指令微调通常降低出价并减少一些可见的病理现象,但大多数机制层面的响应模式基本保持不变。这些发现表明,风险决策中的行为对齐可能是表面层次的:大语言模型可能产生类人风险决策,而不表现出与人类一致的机制。因此,对大语言模型决策的高风险评估应超越结果相似性,检查对齐是否由机制层面的一致性支持。

英文摘要

LLMs can appear cautious in risk decision-making tasks, yet cautious-looking outputs do not necessarily indicate alignment with human decision-making mechanisms. We investigate this distinction using the St. Petersburg game as a controlled testbed, a classical paradox in which the expected payoff is infinite, yet humans typically report low, finite willingness to pay. We evaluate 28 LLMs with a structured prompt suite that includes the original game; controlled decision variants that perturb truncation, repeated play, numeric endowment, and occupational identity; a human-perspective prompt that asks models to reason as human decision makers; and paired comparisons between base models and their instruction-tuned counterparts. In the original game, most models generate finite bids, creating the appearance of human-like risk behavior. However, this outcome-level resemblance masks substantial mechanism-level differences. The controlled variants reveal that rather than maintaining human-like behavior seen in the original game, models often shift to conditionally and computationally rational behavior. Human-cue prompting and instruction tuning often lower bids and reduce some visible pathologies, but most mechanism-level response patterns remain largely unchanged. These findings show that behavioral alignment in risk decision-making can be surface-level: LLMs may produce human-like risk decisions without exhibiting human-consistent mechanisms. High-stakes evaluations of LLM decision-making should therefore move beyond outcome similarity and examine whether the alignment is supported by mechanism-level consistency.

2606.04974 2026-06-04 cs.CL

SAID: Accelerating Diffusion-Based Language Models via Scaffold-Aware Iterative Decoding

SAID: 通过支架感知迭代解码加速基于扩散的语言模型

Na Li, Chengda Wang, Mingju Gao, Hao Tang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学系)

AI总结 提出SAID框架,通过将去噪计算重新分配给支架令牌来加速扩散语言模型,并引入CHLG为低置信度令牌分配额外步骤,在LLaDA模型上实现最高9.1倍加速且保持竞争性能。

Comments Code: https://github.com/TH-AI-Lab-PKU/SAID

详情
AI中文摘要

扩散大语言模型(DLLMs)通过迭代去噪具有双向上下文的损坏令牌序列,实现非自回归生成。尽管它们能够并行更新多个位置,但由于高质量生成需要大量去噪步骤,推理成本仍然很高。我们提出了SAID,一种支架感知迭代解码框架,通过跨令牌重新分配计算来加速DLLMs。SAID首先将去噪计算用于支架令牌以建立粗略的语义结构,然后用更少的步骤完成可预测的细节令牌。我们进一步将SAID适配到块级扩散解码,并引入了置信度分层生成(CHLG),仅为低置信度令牌分配额外的步骤。在LLaDA-8B和LLaDA 1.5上的数学、编码和知识基准实验表明,SAID显著加速了DLLM推理,最高加速比达9.1倍,同时保持了竞争性能。我们的代码公开在:https://github.com/TH-AI-Lab-PKU/SAID。

英文摘要

Diffusion large language models (DLLMs) enable non-autoregressive generation by iteratively denoising corrupted token sequences with bidirectional context. Despite their ability to update multiple positions in parallel, inference remains costly due to the many denoising steps required for high-quality generation. We propose SAID, a Scaffold-Aware Iterative Decoding framework that accelerates DLLMs by reallocating computation across tokens. SAID first spends denoising computation on scaffold tokens to establish the coarse semantic structure, and then completes predictable detail tokens with fewer steps. We further adapt SAID to block-wise diffusion decoding and introduce Confidence-Hierarchical Layered Generation (CHLG), which assigns additional steps only to low-confidence tokens. Experiments on LLaDA-8B and LLaDA 1.5 across math, coding, and knowledge benchmarks show that SAID significantly accelerates DLLM inference with a maximum speedup of 9.1x while maintaining competitive performance. Our code is publicly available: https://github.com/TH-AI-Lab-PKU/SAID.

2606.04971 2026-06-04 cs.LG cs.DB

Be Fair! Can Machine Learning Engineering Agents Adhere to Fairness Constraints?

公平吗?机器学习工程代理能否遵守公平性约束?

Anna Richter, Julia Stoyanovich, Sebastian Schelter

发表机构 * BIFOLD & TU Berlin(BIFOLD与柏林技术大学) New York University(纽约大学)

AI总结 本文研究机器学习工程代理在自动化ML管道开发中能否满足公平性约束,通过黑色素瘤分类实验发现代理生成的管道在预测质量和公平性上均低于人工基线。

详情
AI中文摘要

机器学习工程(MLE)代理承诺从原始数据和自然语言指令自动化端到端ML管道开发,可能使非技术领域专家也能使用ML。然而,在敏感和受监管的领域,这种抽象造成了责任差距:最终用户可能无法了解影响正确性、鲁棒性、公平性和法规遵从性的设计选择。我们认为现有基准不足以评估MLE代理能否安全应用于此类环境。我们提出了以责任为中心的评估框架的期望,并进行了黑色素瘤分类的探索性研究,重点关注跨肤色公平性作为责任约束。在评估两个最近的MLE代理时,我们发现代理生成的管道在预测质量和公平性方面表现出高方差,并且始终低于手动设计的基线,尽管使用了面向公平性的提示。这些初步结果表明,需要进一步研究重新设计MLE代理,以允许人类指导搜索过程并可靠地评估生成的ML管道的合规性和质量。

英文摘要

Machine learning engineering (MLE) agents promise to automate end-to-end ML pipeline development from raw data and natural language instructions, potentially making ML accessible to non-technical domain experts. However, in sensitive and regulated domains, this abstraction creates a responsibility gap: end-users may lack visibility into design choices that affect correctness, robustness, fairness, and regulatory compliance. We argue that existing benchmarks are insufficient to assess whether MLE agents can be safely applied in such settings. We propose desiderata for a responsibility-centered evaluation framework and conduct an exploratory study on melanoma classification, focusing on fairness across skin tones as a responsibility constraint. When evaluating two recent MLE agents, we find that agent-generated pipelines show high variance and consistently underperform manually designed baselines in both predictive quality and fairness, despite fairness-oriented prompts. These preliminary results suggest that further research is needed towards redesigning MLE agents to allow humans to guide the search process and reliably assess the compliance and quality of the generated ML pipelines.